get more, pay more? an elaborate test of construct validity of willingness to pay per qaly estimates...

11
Journal of Health Economics 31 (2012) 158–168 Contents lists available at SciVerse ScienceDirect Journal of Health Economics jo u rn al hom epage : www.elsevier.com/locate/econbase GET MORE, PAY MORE? An elaborate test of construct validity of willingness to pay per QALY estimates obtained through contingent valuation Ana Bobinac , N. Job A. van Exel, Frans F.H. Rutten, Werner B.F. Brouwer Department of Health Policy & Management and Institute for Medical Technology Assessment, Erasmus University Rotterdam, P.O. Box 1738, 3000 DR Rotterdam, The Netherlands a r t i c l e i n f o Article history: Received 29 November 2010 Received in revised form 24 August 2011 Accepted 25 September 2011 Available online 1 October 2011 JEL classification: I10 I31 Keywords: Willingness to pay WTP Contingent valuation Sensitivity to scale Validity QALY a b s t r a c t Estimates of WTP per QALY can be taken as an indication of the monetary value of health gains, which may carry information regarding the appropriate height of the cost-effectiveness threshold. Given the far- reaching consequences choosing a particular threshold, and thus the potential relevance of WTP per QALY estimates, it is important to address the validity of these estimates. This study addresses this issue. Our findings offer little support to the validity of WTP per QALY estimates obtained in this study. Implications for general WTP per QALY estimates and further research are discussed. © 2011 Elsevier B.V. All rights reserved. 1. Introduction Economic evaluations inform allocation decisions in the health- care sector by evaluating alternative interventions in terms of costs and benefits, typically expressed in non-monetary terms such as Quality-Adjusted Life-Years (QALYs) and summarized in an incremental cost-effectiveness ratio (ICER). Common decision rules indicate that an intervention is ‘good value for money’ if the ICER falls below the relevant cost-effectiveness threshold, which represents the relevant value of a health gain within a specific decision-making context. The nature and height of the threshold can vary with the normative rule adopted (Claxton et al., 2011). It can be viewed as representing opportunity costs of spending within the health care sector or as the (monetary) value society places on marginal health gains. If the second, perhaps more common, viewpoint is taken, the monetary value of the QALY can be empiri- cally estimated with some preference elicitation method, the most prominent of which is contingent valuation (CV), i.e., willingness to pay (WTP), which has been applied several times in this context (e.g. Gyrd-Hansen, 2003; King et al., 2005; Bobinac et al., 2010). The Corresponding author. Tel.: +31 10 40 88740; fax: +31 10 4089094. E-mail address: [email protected] (A. Bobinac). important consequence of choosing a particular threshold and thus the potential relevance of the WTP per QALY estimates calls for addressing the issue of validity 1 of WTP per QALY estimates. When such estimates are intended to inform decision making in the healthcare sector, they need to be robust. Issues of validity and reliability are thus of more than merely academic interest (Bateman and Brouwer, 2006). Broadly speaking, the validity of WTP estimates refers to whether the estimates concur with the underlying economic theory, i.e. the neoclassical theory of consumer behavior which pre- dicts that larger gains result in higher WTP, ceteris paribus Validity can thus be judged by considering the robustness of WTP to changes in the QALY gains offered (for instance, by varying the size of the quality improvement or the duration of the health gains). While theory predicts that WTP should increase with increasing QALY gains, it does not predict the exact size of the increase (Fisher, 1996; Bateman and Brouwer, 2006). The relationship between WTP and QALY gains is expected to be increasing yet concave, such that an increase in the QALY gain offered yields a less than proportional 1 Validity is about whether the measure reflects what it intends to (i.e. accu- racy), as opposed to reliability, which deals with whether the instrument measures something other than random noise (i.e. reproducibility) (Jorgensen et al., 2004). 0167-6296/$ see front matter © 2011 Elsevier B.V. All rights reserved. doi:10.1016/j.jhealeco.2011.09.004

Upload: ana-bobinac

Post on 19-Oct-2016

222 views

Category:

Documents


4 download

TRANSCRIPT

Page 1: GET MORE, PAY MORE? An elaborate test of construct validity of willingness to pay per QALY estimates obtained through contingent valuation

Gp

AD

a

ARRAA

JII

KWWCSVQ

1

ccsarIrdcctovcpt(

0d

Journal of Health Economics 31 (2012) 158– 168

Contents lists available at SciVerse ScienceDirect

Journal of Health Economics

jo u rn al hom epage : www.elsev ier .com/ locate /econbase

ET MORE, PAY MORE? An elaborate test of construct validity of willingness toay per QALY estimates obtained through contingent valuation

na Bobinac ∗, N. Job A. van Exel, Frans F.H. Rutten, Werner B.F. Brouwerepartment of Health Policy & Management and Institute for Medical Technology Assessment, Erasmus University Rotterdam, P.O. Box 1738, 3000 DR Rotterdam, The Netherlands

r t i c l e i n f o

rticle history:eceived 29 November 2010eceived in revised form 24 August 2011ccepted 25 September 2011vailable online 1 October 2011

EL classification:1031

a b s t r a c t

Estimates of WTP per QALY can be taken as an indication of the monetary value of health gains, whichmay carry information regarding the appropriate height of the cost-effectiveness threshold. Given the far-reaching consequences choosing a particular threshold, and thus the potential relevance of WTP per QALYestimates, it is important to address the validity of these estimates. This study addresses this issue. Ourfindings offer little support to the validity of WTP per QALY estimates obtained in this study. Implicationsfor general WTP per QALY estimates and further research are discussed.

© 2011 Elsevier B.V. All rights reserved.

eywords:illingness to payTP

ontingent valuationensitivity to scalealidity

itfWtra

wtdciqtgBateman and Brouwer, 2006). The relationship between WTP and

ALY

. Introduction

Economic evaluations inform allocation decisions in the health-are sector by evaluating alternative interventions in terms ofosts and benefits, typically expressed in non-monetary termsuch as Quality-Adjusted Life-Years (QALYs) and summarized inn incremental cost-effectiveness ratio (ICER). Common decisionules indicate that an intervention is ‘good value for money’ if theCER falls below the relevant cost-effectiveness threshold, whichepresents the relevant value of a health gain within a specificecision-making context. The nature and height of the thresholdan vary with the normative rule adopted (Claxton et al., 2011). Itan be viewed as representing opportunity costs of spending withinhe health care sector or as the (monetary) value society placesn marginal health gains. If the second, perhaps more common,iewpoint is taken, the monetary value of the QALY can be empiri-ally estimated with some preference elicitation method, the most

rominent of which is contingent valuation (CV), i.e., willingnesso pay (WTP), which has been applied several times in this contexte.g. Gyrd-Hansen, 2003; King et al., 2005; Bobinac et al., 2010). The

∗ Corresponding author. Tel.: +31 10 40 88740; fax: +31 10 4089094.E-mail address: [email protected] (A. Bobinac).

Qi

rs

167-6296/$ – see front matter © 2011 Elsevier B.V. All rights reserved.oi:10.1016/j.jhealeco.2011.09.004

mportant consequence of choosing a particular threshold – andhus the potential relevance of the WTP per QALY estimates – callsor addressing the issue of validity1 of WTP per QALY estimates.

hen such estimates are intended to inform decision making inhe healthcare sector, they need to be robust. Issues of validity andeliability are thus of more than merely academic interest (Batemannd Brouwer, 2006).

Broadly speaking, the validity of WTP estimates refers tohether the estimates concur with the underlying economic

heory, i.e. the neoclassical theory of consumer behavior which pre-icts that larger gains result in higher WTP, ceteris paribus Validityan thus be judged by considering the robustness of WTP to changesn the QALY gains offered (for instance, by varying the size of theuality improvement or the duration of the health gains). Whileheory predicts that WTP should increase with increasing QALYains, it does not predict the exact size of the increase (Fisher, 1996;

ALY gains is expected to be increasing yet concave, such that anncrease in the QALY gain offered yields a less than proportional

1 Validity is about whether the measure reflects what it intends to (i.e. accu-acy), as opposed to reliability, which deals with whether the instrument measuresomething other than random noise (i.e. reproducibility) (Jorgensen et al., 2004).

Page 2: GET MORE, PAY MORE? An elaborate test of construct validity of willingness to pay per QALY estimates obtained through contingent valuation

alth E

iTii1

stieect2fitalttbpdniitt1nwWos

h1aGeespetetts

tdabaitWDtt

s

gmaaataoma4

2

secmanCc

u

HTrsp

HWoef

Hsgg

Hatt

gbdrgb

2

5

A. Bobinac et al. / Journal of He

ncrease in WTP (Bradford, 1972; Smith, 2005; Olsen et al., 2004).his is mainly due to diminishing marginal utility of health and thencome effect, i.e., an increasing WTP takes a higher proportion ofncome and thus decreases the ability to pay (Flores and Carson,997; Smith, 2005).

WTP estimates have been criticized for their insensitivity tocale, implying that they do not vary ‘meaningfully’ with the quan-ity of the offered good. Given that a non-proportional increases expected theoretically, the degree of robustness must also bevaluated to allow more general claims about the validity of WTPstimates. In other words, finding a significant (and positive) coeffi-ient of the health gain (or income) in a linear regression explaininghe variance in WTP, usually termed ‘theoretical validity’ (e.g. Ryan,004; Lienhoop and MacMillan, 2007), is a necessary but not suf-cient condition for a more general claim about validity. Indeed,he appropriate sign does not necessarily imply that the associ-ted variation is ‘practically meaningful’ or, as it also has beenabeled, ‘theoretically plausible’ (Olsen et al., 2004), let alone thathe estimates can be directly applied in decision making. Resultshat cannot be shown to be theoretically invalid may thus stille considered practically irrelevant. Judging whether results areractically meaningful, i.e., whether the size of the coefficient iseemed appropriate, requires a judgment that is normative andot directly informed by theory (Hammitt and Graham, 1999). This

ssue has yet to receive due attention in the literature. It has, fornstance, been suggested that WTP estimates for small risk reduc-ions need to be near-proportional (increasing and strictly concave)o the size of the risk reduction (NOAA, 1993; Hammitt and Graham,999). Although perhaps somewhat restrictive, the condition ofear-proportionality might thus be appropriate in establishinghat practically meaningful (i.e. ‘theoretically plausible’) refers to.e will use this benchmark here as well. Still, obviously, when

ne exactly considers something to be near-proportional, is againomewhat arbitrary.

Although the validity of WTP estimates for goods other thanealth has attracted considerable attention (e.g. Desvousges et al.,993; McFadden and Leonard, 1993; Jones-Lee et al., 1995; Carsonnd Mitchell, 1995; Frederick and Fischhoff, 1998; Hammitt andraham, 1999; Smith, 2001; Van Exel et al., 2006; Van Houtvent al., 2006; Smith and Sach, 2010; Baker et al., 2010), in-depthmpirical interest in the validity of WTP for changes in health pere is limited (Olsen et al., 2004; Smith, 2005; Yeung et al., 2003),articularly when health is expressed in terms of QALYs. (A notablexception is Pinto-Prades et al., 2009.) Validity, in that sense, is nothoroughly addressed also in studies reporting the WTP per QALYstimates (e.g. Shiroiwa et al., 2010; Donaldson et al., 2011). Givenhe commonness of using QALYs as a measure of health gains andhe increased interest in the monetary value of QALYs, however,uch studies appear warranted.

Our study contributes to the literature by extensively exploringhe validity of WTP per QALY estimates, using a data set explicitlyesigned to (1) estimate the WTP per QALY and (2) test the variousspects of this estimate’s validity. This data set was previously usedy Bobinac et al. (2010) for the purpose of reporting and discussingverage WTP per QALY estimates while this paper reports on test-ng the validity of the WTP per QALY estimates through examininghe relationship between the WTP estimates and the QALY gains.

e define validity in terms of ‘construct validity’ (Jakobsson and

ragun, 1996), which encompasses scale sensitivity of WTP2 and

he related sub-additive impartiality (or ‘part-whole’ bias). The lat-er refers to the fact that the WTP for the same quantity of some

2 Since the QALY is the outcome of interest, we will not address the sensitivity tocope in this paper (i.e., sensitivity to a range of goods on offer).

tnoDatd

conomics 31 (2012) 158– 168 159

ood is typically less when this quantity is offered as a whole, andore when it is offered in separately valued parts. Our study design

llows validity testing along both dimensions of a QALY (lengthnd quality of life), within and between-blocks of data, and onggregate and sub-group levels, thus allowing us to account forhe underlying heterogeneity in preferences. Particular hypothesisre described in Section 2 and tested in Section 3. The implicationsf the results and the application of WTP studies in estimating theonetary value of QALY gains in the context of determining an

ppropriate cost-effectiveness threshold are discussed in Section.

. The test

Our study uses a contingent valuation (CV) data set from a repre-entative sample of the Dutch population aged 18 to 65, designed tostimate the WTP for a QALY from the individual perspective underertainty and to test the construct validity of WTP per QALY esti-ates. (See Bobinac et al., 2010 for more details on data collection

nd results.) Respondents were recruited by a professional Inter-et sampling company and the questionnaire administered online.ompletion was rewarded by a small sum donated to a charity ofhoice.

The construct validity of WTP per QALY estimates was testedsing the following hypotheses:

ypothesis I. WTP is sensitive to scale in terms of quality of life.hat is, for a given duration, a larger gain in quality of life shouldesult in a higher WTP, c.p., both between and within-blocks. Theensitivity will be evaluated in terms of theoretical validity andractical meaningfulness.

ypothesis II. WTP is sensitive to scale in terms of duration.ithin-blocks, for a given gain in quality of life, a longer duration

f the gain should result in a higher WTP. The sensitivity will bevaluated in terms of theoretical validity and practical meaning-ulness.

ypothesis III. Subgroup-level data exhibits an increased level ofensitivity, relative to average-level data. That is, in specific sub-roups WTP is more sensitive to changes in the size of the offeredain, both between and within-blocks.

ypothesis IV. WTP per QALY estimates are affected by the sub-dditivity bias. Within-blocks, the sum of the values attached towo smaller QALY gains are expected to exceed that attached tohe sum of these gains when valued jointly.

A significant difference between the WTP for smaller and largerains is a necessary condition for establishing construct validity,ut it need not be a sufficient one. A more definite test would beisproving the sub-additivity bias, i.e., finding a ‘near-proportional’elationship between WTP estimates and the size of the healthains on offer, thus establishing a (near) additive relationshipetween them.

.1. Survey instrument

42 health states, described using the EuroQoL-5D system (or EQ-D; EuroQol Group, 1999), were paired into 29 scenarios. Many ofhe selected health states were originally applied in deriving theational tariffs for the EQ-5D (Kind et al., 1998; Lamers et al., 2006)r applied by Gyrd-Hansen (2003) to estimate the WTP per QALY in

enmark. The 29 scenarios, representing a fair spread of QALY gainscross the utility range (Fig. 1), were (with some overlap) assignedo 10 different blocks of six scenarios. Respondents solved one ran-omly assigned block. Four out of six scenarios from each block are
Page 3: GET MORE, PAY MORE? An elaborate test of construct validity of willingness to pay per QALY estimates obtained through contingent valuation

160 A. Bobinac et al. / Journal of Health Economics 31 (2012) 158– 168

Table 1Design of scenarios.

Scenarios 1–3 Scenario 4: pairing and duration in blocks

Scenario HS1 HS2 HS1(tariff)

HS2(tariff)

QALYgain

Durationin years

1 2 3 4 5 6 7 8 9 10

Blocks 1–41 21312 12111 0.478 0.847 0.369 1 3 5**

2 22323 21312 0.109 0.478 0.369 1 3* 53 22323 12111 0.109 0.847 0.738 1

Blocks 5–71 12311 11211 0.556 0.897 0.341 1 32 32311 12311 0.395 0.556 0.161 1 3 53 32311 11211 0.395 0.897 0.502 1

Blocks 8–101 11312 11211 0.514 0.897 0.383 1 32 11332 11312 0.185 0.514 0.329 1 3 53 11332 11211 0.185 0.897 0.712 1

* The “3” indicates that respondents in block 1 (see column in table), in addition to the three scenarios listed in columns 2–4 of the table, evaluated a fourth scenario,which was identical to scenario 2 (see row in table) in terms of the health states presente

** The “5” indicates that respondents in block 4 (see column in table), in addition to thwhich was identical to scenario 1 (see row in table) in terms of the health states presente

rbph2nsitcsdoc

tQ

t

ata

wcehresficwsot

(lwoawtuamwsemBe

tacof the health states valued on the VAS remained present on thescreen as a reminder of the size of the gain being valued. Ex post,respondents were asked which part of household spending they

Fig. 1. Spread of gains across the utility plain.

elevant for this study as they were specifically designed and com-ined to test the construct validity of WTP per QALY estimates. Inarticular, the first two scenarios in each block represented smallerealth gains that, according to Dutch EQ-5D tariffs (Lamers et al.,006), added up to a larger health gain presented in the third sce-ario (Table 1). Health gains in different scenarios purposefullytarted either low or in the middle of the QALY scale, ending eithern the middle or high on the scale. To avoid specific dimensions ofhe EQ-5D to dominate our results, we varied the dimensions thatonstitute a quality of life gain within scenarios.3 In each block, onecenario was repeated as the fourth scenario, but now with a longeruration (i.e., 3 or 5 years instead of 1 year; see the right-hand sidef Table 1). The combinations of health states and durations werehosen to ensure comparability across blocks.

The two remaining scenarios were not purposefully designed toest construct validity but for the calculation of individual WTP per

ALY under certainty.

Respondent-specific health state valuations were obtained fromhe visual analog scale (VAS), as these represent the gains that

3 I.e., in the first four blocks quality of life changed between the two health stateslong all the EQ5D dimensions; in blocks 5–7 quality of life gains were achieved inhe mobility, self-care and daily activities segments and in blocks 8–10 the gain ischieved in the “mental” dimensions of pain and anxiety.

tl

5

d but differed in terms of duration (i.e. 3 years instead of 1 year).e three scenarios listed in columns 2–4 of the table, evaluated a fourth scenario,d but differed in terms of duration (i.e. 5 years instead of 1 year).

ere actually valued, in two steps.4 First, respondents rated theirurrent health, death, and perfect health on the VAS bounded bynd-points labeled “best imaginable health” and “worst imaginableealth”. This allowed for health states worse than death. Second,espondents indicated which of two presented health states inach scenario was better and then rated the two states on the VAShowing the previous valuations of current health, death and per-ect health, thus providing a valuation context. Respondents werenstructed to imagine being in the better health state and to indi-ate their WTP to avoid 1 year in the health state they chose asorse. The health loss, i.e., the difference between the two health

tates, could be avoided by taking a painless medicine for whichne had to pay out-of-pocket in 12 monthly installments (avoidinghe need to correct for discounting).

WTP was elicited in a two-step procedure: a payment scalePS) (Donaldson et al., 1995; Olsen and Donaldson, 1998), fol-owed by a bounded ‘open-ended’ (OE) question. Respondents

ere first presented with an ordered low-to-high payment scalef monthly installments5 and asked to indicate the maximummount they would certainly pay and the minimum amount theyould certainly not pay (Donaldson et al., 1997). Together, the

wo answers provided a range of values for which people werencertain (Dubourg et al., 1997). Secondly, respondents were given

bounded OE follow-up question, asking them to indicate theaximum amount they would pay if asked to do so right now,ithin the boundaries they had indicated in the first step. This two-

tep approach was applied to arrive at a more precise and robuststimate of the maximum WTP.A combination of WTP elicitationethods, although in different settings, was applied before (e.g.

hatia and Fox-Rushby, 2003; Cameron and Quiggin, 1994; Johnsont al., 2000).

For the purpose of reducing the hypothetical bias inherento CV exercises (Blomquist et al., 2009), respondents were, exnte, reminded to take their net monthly household income intoonsideration when solving the exercise (NOAA, 1993). The image

4 VAS was used instead of a TTO or SG because it is easier to use, especially inhe context of self-completion, which was deemed an important consideration in aarge and complex, web-based questionnaire such as this one.

5 Monthly installments were: D 0; 10; 15; 25; 50; 75; 100; 125; 150; 250; 300;00; 750; 1000; 1500; and 2500.

Page 4: GET MORE, PAY MORE? An elaborate test of construct validity of willingness to pay per QALY estimates obtained through contingent valuation

alth E

wtit

dfsadschvmmm

fqa

2

ila

X

Vaaasait

usaMpQty

aospo

s

(

wD(z

bfefTrsWeWndtf(2

eeh

ri(abwdiewr

WW(tat

umudmtwtaitafter the respondents had gathered some experience answering

A. Bobinac et al. / Journal of He

ould economize on to pay for the health gain6 (NOAA, 1993) ando indicate how sure they were about their stated WTP.7 Finally,f respondents chose D 0 as their maximum WTP, they were askedo indicate an explanation.8

The questionnaire was pilot-tested in a sample of 100 respon-ents to determine the plausibility and clarity of the tasks, theeasibility of the questionnaire, and test the range of the paymentcale. Respondents could express their opinion about the taskst hand but none of the comments pointed to the task being tooifficult or unrealistic. Combined with a low dropout rate and rea-onable completion time, it was judged that respondents wereapable of understanding and solving the tasks at hand. The pilot,owever, showed that the payment scale was not optimal since thealues above D 2500 were never chosen. To avoid the loss of infor-ation and possible anchoring to exaggeratedly high values, theaximum was lowered to D 2500 and amounts added around theost frequent values.Scenarios were presented in random order to optimally control

or possible order bias (e.g. Bateman and Jones, 2003). Following theuestionnaire, respondents were asked about their socio-economicnd demographic characteristics.

.2. Analyses of hypotheses

The sample-specific utility weights were calculated in a rescal-ng procedure intended to correct for the VAS end-points not beingabeled as “death” and “perfect health” but “best imaginable health”nd “worst imaginable health”, based on the formula:

RESCALED = XRAW − XMEAN of DEATH

XMEAN of PERFECT HEALTH − XMEAN of DEATH(1)

The QALY gains were calculated as the difference between theAS and EQ-5D weights of the two health states, respectively. Theverage, point estimates of the WTP per QALY were calculateds an average of the ratios between the open-ended (OE) WTPnswer and the QALY gain, for every data row, using the sample-pecific VAS scores. Discount rates for health were obtained bysking respondents about their indifference between 10 days ofllness next month and another period of illness in 3 or 5 yearsime.

The distributional properties of WTP estimates were analyzedsing Kurtosis and Shapiro–Wilk tests for normality. The hypothe-es were tested using the parametric t-test on log-transformednd non-transformed WTP estimates, and the non-parametricann–Whitney U-test (Yeung et al., 2003). Particular attention was

aid to testing the income variable and its effect on the WTP perALY estimates. Due to multiple observations per respondent, all

ests were repeated to check for clustering effects. Statistical anal-ses were performed in STATA11.

Table 2 summarizes the hypothesis and their related tests. Forll hypotheses testing, the change in WTP was evaluated in termsf its sign, size and practical meaningfulness (i.e. theoretical plau-

ibility), relative to the change in the size of the health gain. Inarticular, sensitivity to scale was examined in terms of qualityf life (Hypothesis I), both between and within blocks. Between

6 Answer options were (i) food; (ii) clothing; (iii) entertainment; (iv) sports; (v)avings; (vi) charity and (vii) other (Smith, 2006).

7 Answer options were (i) totally sure; (ii) pretty sure; (iii) maybe yes, maybe no;iv) probably not and (v) surely not.

8 Answer options were: (i) I am unable to pay more than D 0; (ii) avoiding theorse health state and remaining in the better health state is not worth more than

0 to me; (iii) I am not willing to pay out of ethical considerations and (iv) otherwith an open text field for explanation). Options (i) and (ii) were considered “trueero” WTPs, whereas for options (iii) and (iv) this is less clear.

q

tpm

aea“t

conomics 31 (2012) 158– 168 161

locks, the statistical differences between WTP estimates for dif-erent gains was tested under the premise that different sampleslicit similar WTP for similar gains and statistically different WTPor gains of different sizes (e.g. comparing scenarios 1 across blocks;able 1). The within-block tests were performed to reveal whetherespondents, when faced with consecutive health gains varying inize, assigned a statistically significantly (and meaningfully) higher

TP to higher gains. In terms of Table 1, we tested whether, forxample, scenarios 1 and 2 yielded lower and statistically differentTP estimates than those obtained in scenario 3. Because the sce-

arios were presented consecutively to respondents – potentiallyrawing attention to the size of the health gain – one could expecthe within-blocks tests to be more likely to detect a (meaning-ul) increase in WTP estimates (given the increase in health gains)Kahneman et al., 1999; Hammitt and Graham, 1999; Olsen et al.,004).

Sensitivity to scale in terms of duration (Hypothesis II) wasxamined within blocks. Health gains of longer duration werexpected to be valued statistically significantly (and meaningfully)igher (Table 1).

Sensitivity to scale was further examined in subgroups ofespondents with different (i) levels of net monthly householdncome9 and (ii) reported levels of certainty in the WTP answersJohannesson et al., 1999), thus addressing Hypothesis III. Subgroupnalysis was performed using both within-blocks and between-locks tests. With respect to income, the sensitivity to scaleas expected to increase with the absolute level of income butecrease with the proportion of income sacrificed, regardless of

ts absolute level, thus disclosing the “income effect”. In terms ofxpressed certainty, we expected that higher levels of certaintyould be associated with more sensitivity due to more reasoned

esponses.Sub-additivity (Hypothesis IV) was examined by comparing

TP for the health gain in scenario 3 of each block with the sum ofTP estimates for the health gains in scenarios 1 and 2 (Table 1).

Recall that the two smaller gains added up to the larger gain inerms of Dutch EQ-5D tariffs.) We expected the sum of the valuessigned to two smaller gains (i.e., the first two scenarios) to exceedhe value assigned to the summed gain in scenario 3.10

Validity was further explored with multivariate regressions,sing the QALY gain as the independent variable, along with theost common socio-economic factors. WTP per QALY was not

sed as the dependent variable since the focus here is on theeterminants of the variance in the WTP, especially those stem-ing from the changes in the size of the QALY gain. Finally,

he results were tested for specific framing effects: order biasas tested by considering the strength of correlation between

he mean WTP estimates across all scenarios and the WTPssigned in the first scenario presented to a respondent; learn-ng bias was tested by inspecting improvement in sensitivityo scale in scenarios presented later in the questionnaire, i.e.,

uestions.

9 Income groups were defined by the national income distribution, such thathe poorest (household income <D 1000 per month) and the wealthiest (>D 3500er month) groups comprised 13% and 12% of the sample, respectively, while twoiddle groups comprised 35% and 40% of the sample.

10 In additional analyses, the individuals were assigned to one of three sub-dditivity categories: positive “scope” (the sum of the values assigned to the partsxceeded the value assigned to the whole), negative “scope” (the sum of the valuesssigned to the parts was lower than the value assigned to the whole), and neutralscope” (the sum of the values assigned to the parts equaled the value assigned tohe whole).

Page 5: GET MORE, PAY MORE? An elaborate test of construct validity of willingness to pay per QALY estimates obtained through contingent valuation

162 A. Bobinac et al. / Journal of Health Economics 31 (2012) 158– 168

Table 2A summary of tests and hypothesis.

Description Main features A priori expectations Tests

Hypothesis I WTP is sensitive to scale interms of quality of life

Quality of life varies, durationconstant (1 year)

• Larger quality of life gains shouldresult in higher WTP• Between-blocks: differentsub-samples solving differentblocks elicit similar and notstatistically different WTP forsimilar gains, and larger andstatistically different WTP for gainsof larger size in terms of quality oflife• Within-blocks: respondentssolving a single block offeringconsecutive health gains varying insize assign a higher and statisticallydifferent WTP to the higher gain interms of quality of life

The differences in WTP estimatestested using parametric t-test onWTP data, parametric t-test onlog-transformed WTP data and thenon-parametric Mann–WhitneyU-test, applied within andbetween-blocks

Hypothesis II WTP is sensitive to scale interms of duration

Quality of life constant,duration varies (3 or 5 years)

• Longer duration of gains shouldresult in higher WTP• Within-blocks: respondents,solving a single block offeringconsecutive health gains varying insize, assign a higher andstatistically different WTP to gainswith longer duration

The differences in WTP estimatestested using parametric t-test onWTP data, parametric t-test onlog-transformed WTP data and thenon-parametric Mann–WhitneyU-test, applied between-blocks

Hypothesis III Subgroup-level data exhibitsincreased level of sensitivity,relative to average-level data

Main features equal to those ofHypotheses I and II. Subgroups:(i) levels of household incomeand (ii) levels of certainty inthe WTP answers

• In line with expectations ofHypotheses I and II

Repeat the tests of Hypotheses Iand II on subgroup level data

Hypothesis IV WTP estimates are affected bythe sub-additivity bias

Quality of life varies, durationconstant (1 year)

• The sum of the WTP for twosmaller QALY gains exceeds theWTP for the sum of those gains• Within-blocks: respondents,solving a single block offeringconsecutively two smaller healthgains and one health gain equal tothe sum of the smaller health gains,assign a proportionally smallerWTP for to the smaller gains thanto the larger gain, and the WTP

The differences in WTP estimatestested using parametric t-test onWTP data, parametric t-test onlog-transformed WTP data and thenon-parametric Mann–WhitneyU-test, applied within-blocks

3

(tmD

6wdor

m

btptea‘t

1iw

snnWcrcw

. Results

1091 respondents representative of the Dutch population in agefrom 18 to 65 years of age), gender, and education participated inhe survey (Table 2). On average, 2.44 people shared an average net

onthly household income of D 2564, adequately representing theutch national figures for 2008 (CBS, 2009) (Table 3).

Most respondents exhibited a positive WTP for health gains.2 respondents indicated, in one or more scenarios, that theyould not pay more than D 0 for a health gain (only 23 respon-ents indicated D 0 in all 5 scenarios).11 Given the small numberf zeros, further analysis was performed including “zero value”

espondents.

The sample-specific VAS scores were rescaled on the mean andedian scores for perfect health and death (90 and 0, and 84 and

11 Explanations for not paying more than zero were fairly equally distributedetween the four possibilities on offer (i.e. around 25% each of the four explana-ions). A difficult issue is always how to interpret zero answers, even in light of therovided explanations. We could not investigate this further, but it may well behat these respondents were simply not prepared to express their health gain pref-rences along the chosen valuation instrument. Given the small amount of zerosnd the negligible influence on results, and the difficulty in labeling zero’s as true

protest answers’, we decided to include them in further analyses. It seems, however,hat more research in this area is warranted.

bs1tqu

ieoUpe

estimates for the smaller gains addup to the WTP of the larger gain

5.4, respectively). Since the mean scores exhibited larger variationn estimates and fit the EQ-5D tariffs better than median scores (i.e.

ere more similar), only these are presented henceforth.Overall, QALY gains based on sample-specific VAS scores were

omewhat lower than those based on Dutch EQ-5D tariffs, with oneotable exception (scenario 2 in blocks 5–7, Table 4). In most sce-arios one health state was unambiguously better than the other.e tested and confirmed that the better health states systemati-

ally received higher average valuations on the VAS-respondentseversed the ranking in fewer than 5% of scenarios. However, theorrelation between EQ-5D tariffs and sample-specific VAS scoresas relatively low (r = 0.24, p = 0.02). Although the average ratio

etween QALY gains based either on existing tariffs or the VAScores was 0.97, the dispersion of estimates around the ratio of

is considerable (Table 4). Since the VAS QALY gains representhe estimates that respondents themselves provided and subse-uently valued through the WTP exercise, these estimates will besed henceforth.

The results of tests for sensitivity to scale in terms of qual-ty of life (Hypothesis I) are presented in Table 4. As could bexpected, the WTP data was skewed and thus the parametric t-test

n log-transformed data and the non-parametric Mann–Whitney-tests were applied. With respect to within-block analysis, thearametric and non-parametric tests revealed no statistical differ-nce between mean WTP estimates for gains of comparable size
Page 6: GET MORE, PAY MORE? An elaborate test of construct validity of willingness to pay per QALY estimates obtained through contingent valuation

A. Bobinac et al. / Journal of Health Economics 31 (2012) 158– 168 163

Table 3Summary statistics (n = 1091).

Variable Mean sd Min Max

Age 42.1 12.1 18 65Gender (% female) 0.53 0.50Marital status (%)

Married 0.61 0.49Divorced 0.10 0.31Single 0.24 0.43Widowed 0.03 0.16Unknown 0.02 0.14

Children (% yes) 0.56 0.50Number (n = 3070) 2.23 10.1 1 10

Household monthly income (mean/median D ) 2564/2499Household monthly income groups (%)

Group 1 (<1000 D ) 0.13 0.33Group 2 (>999 and <2000 D ) 0.34 0.48Group 3 (>1999 and <3500 D ) 0.40 0.49Group 4 (>3499 D ) 0.12 0.33

Number of people living on household income 2.44 10.4 1 20University education (%) 0.36 0.48Employment status (%)

Employed 0.62 0.48Unemployed 0.17 0.38Student 0.06 0.25Housewife/husband or retired 0.14 0.35

Health statusEQ-5D (Dutch tariff) 0.84 0.22 −0.26 1.00EQ-VAS 78.5 170.1 0 100

Suffering a chronic illness (%) 0.39 0.94Subjective life expectancy 81.9 11.2 30 120Completion time of the questionnaire 18.8 60.13 9 61Levels of certainty (%)

Totally sure 14.4Pretty sure 41.7Maybe yes, maybe no 32.9

icsrmw

sabi

tbci0Db

TRe

c

Probably not 8.0Surely not 3.0

n scenarios 1 and 2 in all blocks. Although there is no statisti-ally significant difference between mean WTP estimates, the gainsituated lower on the scale systematically received a higher WTPelative to similarly sized gains positioned higher on the scale. Thisay signal that a given health gain is considered more valuablehen attained low on the QALY scale.

When testing sensitivity to scale by comparing the valuations in

cenarios 1 and 2 with those in scenario 3 (representing a consider-bly higher gain), a (marginally) statistically significant differenceetween WTP for smaller and larger gains was only observed

n blocks 8–10 (p = 0.1 for scenario 1 vs. 3 using the parametric

w

ot

able 4esults of the scale sensitivity test: scenarios with health gains of different size and equxpressed in D ).

Scenario n HS1(tariff)

HS2(tariff)

QALY gain(tariff)

Valuati

HS1

Blocks 1–4 1 444 0.478 0.847 0.369 0.412

2 444 0.109 0.478 0.369 0.305

3 444 0.109 0.847 0.738 0.303

Blocks 5–7 1 329 0.556 0.897 0.341 0.402

2 329 0.395 0.556 0.161 0.262

3 329 0.395 0.897 0.502 0.256

Blocks 8–10 1 318 0.514 0.897 0.383 0.465

2 318 0.185 0.514 0.329 0.323

3 318 0.185 0.897 0.712 0.319

* Gains were calculated for each individual and then averaged; therefore the differencolumn.** Ratio of QALY gain estimated using VAS to QALY gain by tariff.

est). In terms of practically meaningful (or ‘theoretically plausi-le’) increases in WTP, however, it appears implausibly low whenompared to the increase in the health gain on offer. Indeed, anncrease in the health gain of 50% (from 0.310 in scenario 1 to.442 in scenario 3) resulted in an increase in WTP of no more than

7 per month (+3.6%). In other blocks, no significant differencesetween the values in scenarios 1 and 2 and those in scenario 3

ere detected.

We explored whether the differences between the valuationsbtained in different scenarios would be more pronounced onhe subgroup level (using within-block tests in subgroups). The

al duration. WTP per QALY estimates rounded to hundreds (all monetary values

on using VAS (rescaled) WTP (month) WTP per QALY

HS2 QALY gain* Ratio** Mean (sd) Median Mean

0.719 0.348 0.94 150 (319) 75 83000.530 0.268 0.73 170 (349) 75 16,2000.737 0.476 0.64 174 (358) 75 6800

0.728 0.337 0.99 167 (318) 100 13,6000.570 0.340 2.11 178 (320) 100 12,0000.743 0.496 0.99 197 (366) 100 7600

0.748 0.310 0.81 167 (353) 75 11,9000.570 0.294 0.89 196 (360) 100 15,2000.719 0.442 0.62 203 (385) 100 9000

e between two gains on average is not equal to the average gain presented in this

Page 7: GET MORE, PAY MORE? An elaborate test of construct validity of willingness to pay per QALY estimates obtained through contingent valuation

1 alth E

rmiasoe(re

fgmbdicieohgitf(ernds

(tiowitstvdaatt(nibp

3badbslirt

tIsntvvtssnAichbsmc

eateigia(tw

socttaebtvtr

QoFs

64 A. Bobinac et al. / Journal of He

esults indicated that “highly certain “respondents exhibited only aarginally higher level of sensitivity to scale: a statistically signif-

cant difference was observed between scenario 1 and scenarios 2nd 3 in blocks 8–10 (p = 0.02 and 0.01, respectively) and betweencenarios 1 and 3 in blocks 5–7 (p = 0.05). These respondents weren average younger (p = 0.00), in better health (p = 0.00), more oftenmployed (p = 0.00), and devoted more time to the questionnairep = 0.00). No difference in sensitivity to scale was detected betweenespondents belonging to different income groups, an issue furtherxplored below.

Between-block tests revealed no statistical differences in WTPor similar gains in scenarios 1 and 2 across blocks (Table 4). Forains of 0.268–0.348, the WTP ranged from D 150 to D 167 peronth. We can interpret this result in different ways. First, it might

e encouraging that, when presented with gains of similar size,ifferent samples elicit similar WTP estimates. On the other hand,

t may be that a fluctuation of D 17 per month between groupsompared to a fluctuation in health gain of 0.08 QALY indicates annsensitivity to scale. (The difference between the highest and low-st health gains was 29.9%, with a corresponding increase in WTPf 11.3%). However, a significant difference in WTP estimates forealth gains in scenario 3 was found across blocks, although theseains were also comparable in size. This result prompted additionalnvestigation, since it might be caused by income constraints, givenhat the gains in scenario 3 were relatively large. However, weound that the majority of WTP-to-income ratios was fairly lowmean = 7.4%, median = 3.6%). We also found that respondents wholicited bids corresponding to an above average WTP-to-incomeatio (and thus more closely approached the income constraint) didot exhibit lower variability in WTP estimates than other respon-ents (Smith, 2005). Budget constraints are thus unlikely to explainuch findings.12

Table 5 shows the sensitivity with respect to durationHypothesis II). The mean WTP values somewhat increased whenhe duration increased but the increase was mostly statisticallynsignificant (p = 0.09 or higher), both when using non-parametricr parametric tests and in both subgroups. In fact, median valuesere almost identical between smaller and larger gains, emphasiz-

ng the degree of insensitivity. Seemingly, respondents were ableo assign similar values to similarly sized gains in two differentcenarios (even though they were not specifically informed abouthe equality of health states) but failed to assign significantly higheralues to benefits that lasted longer. For instance, in block 1 respon-ents valued a health improvement of similar size in scenarios 2nd 4 (VAS gains of 0.224 and 0.220, respectively) and elicited anlmost equal mean WTP for both gains (D 194 and D 196, respec-ively). However, the duration of the gain in scenario 4 is threeimes longer than in scenario 2 (Table 5) and, therefore, the totalundiscounted) gain in scenario 4 is three times higher than in sce-ario 2. An increase in WTP of D 2 seems, besides being statistically

nsignificant, practically meaningless (i.e., theoretically implausi-le), adding to the negative evidence on sensitivity to scale of WTPer QALY estimates.

12 We also investigated the ratio between WTP for the larger gains in scenario and smaller gains in scenario 1 in all blocks for respondents with an incomeelow the mean and median as well as those above. If budget constraints played

substantial role, the ratio would be expected to be considerably larger for respon-ents with a higher income, because their ability to express a ‘true’ WTP woulde less constrained. Such a finding would indicate the influence of a budget con-traint (Pinto-Prades et al., 2009). In our study, the ratio turned out to be 1.10 forow income respondents and 1.28 for high income respondents, a 16.6% differencen response variation. This was modest (especially considering that it was not cor-ected for other variables such as education or age), giving little reason to expecthat budget constraints were a main driver of our results.

vesuehWp1frtesd

conomics 31 (2012) 158– 168

The sub-additivity bias (Hypothesis IV) was confirmed. Notehat in all blocks, scenarios 1 to 3 used three distinct health states.f we label them, ordered from lowest- to highest-ranked healthtates A, B and C, scenario 1 valued the distance from A to B, sce-ario 2 from B to C and scenario 3 from A to C. Since we randomizedhe order of the different scenarios, the three health states were allalued twice (i.e., in each scenario in which they appeared). If thealuations of the health states had been equal in both valuations,he distance on the VAS from A to C in scenario 3 should equal theum of the distance between A and B in scenario 1 and B and C incenario 3. As we can see from Table 4, this was not the case. In sce-ario 3, the distance between A and C was smaller than the sum of

to B and B to C (in scenarios 1 and 2). (For example, 0.348 + 0.268s more than 0.467 in blocks 1–4, Table 4.) The difference was notaused by different VAS valuations of the highest (C) and lowest (A)ealth states (which were valued almost identically both times),ut by a difference in valuation of the middle health state (B) incenarios 1 and 2 (with mean VAS scores of 0.42 and 0.55). Thisay be caused by a combination of ceiling effects and the wish to

learly differentiate between health states on the VAS scale.Analyzing the WTP for the two smaller gains makes it clear that,

ven considering the fact that the VAS gains valued in scenarios 1nd 2 may exceed the VAS gain valued in scenario 3 (even thoughhe two extreme health states were of identical value in the differ-nt scenarios), there is clear evidence for a sub-additivity bias. Fornstance, considering blocks 1–4 (Table 4), the WTP for the integralain between the two extreme health states in scenario 3 (D 174)s far lower than the valuations of the smaller gains in scenarios 1nd 2 (D 150 and D 170, respectively).In fact, only 28 respondents2.5%) indicated neutral “scope” on one or more occasions such thathe valuations of the parts actually added up to the valuation of thehole.

Multivariate regression models are presented in Table 6. Theigns of coefficients aligned with a priori expectations. The sizef coefficients (0.05 and 0.06 in models 1 and 2), however, indi-ated a clearly non-proportional relationship between WTP andhe size of the health gain. The sign and the relationship betweenhe main variables remained stable when introducing other vari-bles in the regression (model 2, Table 6; Olsen et al., 2004; Kingt al., 2005). Such results emphasize that while the relationshipetween WTP and size of the health gain (both per year and inerms of duration) is in the expected direction (i.e., ‘theoreticallyalid’), the size of the coefficients raise important questions abouthe practical meaningfulness or ‘theoretical plausibility’ of theesults.

As expected, given the findings already presented, the WTP perALY estimates varied considerably with the changes in the sizef the gain, both when the quality of life and the duration varied.or example, the gain of around 0.34 received in three differentcenarios a WTP per QALY of D 8300, D 13,600 and D 12,000. Suchalues are comparable to others reported in the literature (e.g. Kingt al., 2005; Byrne et al., 2005; Gyrd-Hansen, 2003), while theyeem to be on the lower end of values commonly proposed forse in decision-making (Bobinac et al., 2010). The gains of differ-nt duration received even inversely related estimates: gains ofigher duration (3 or 5 years) received up to five times smallerTP per QALY estimates than gains of shorter duration. For exam-

le, in block 1, WTP per QALY was D 21,400 when the gain lasted year but only D 5900 when 3 years. On average, discount ratesor health were 14% and 23% for the 5- and 3-year time span,espectively. However, while mitigating the differences somewhat,

hese discount rates cannot explain the observed differences in thestimates of WTP for the gains of such different duration or, con-equently, the differences in WTP per QALY between different gainurations.
Page 8: GET MORE, PAY MORE? An elaborate test of construct validity of willingness to pay per QALY estimates obtained through contingent valuation

A. Bobinac et al. / Journal of Health Economics 31 (2012) 158– 168 165

Table 5Results of the scale sensitivity test: scenarios with health gains of equal size and different duration. WTP per QALY estimates rounded to hundreds (all monetary valuesexpressed in D ).

Scenario n HS1(tariff)

HS2(tariff)

QALYgain

Duration Valuation using VAS (rescaled) WTP (month) WTP perQALY

HS1 HS2 QALY gain Mean (sd) Median Mean

Block 1 2 113 0.109 0.478 0.369 1 0.29 0.514 0.224 194 (460) 80 21,4004 3 0.284 0.504 0.22 196 (447) 70 5900

Block 2 2 110 0.109 0.478 0.369 1 0.3 0.54 0.24 127 (157) 75 10,4004 5 0.311 0.53 0.219 169 (301) 75 2900

Block 3 1 110 0.478 0.847 0.369 1 0.41 0.751 0.341 168 (308) 80 81004 3 0.41 0.734 0.324 199 (368) 100 4600

Block 4 1 111 0.478 0.847 0.369 1 0.424 0.71 0.286 92 (90) 75 55004 5 0.418 0.734 0.316 107 (116) 70 1200

Block 5 2 109 0.395 0.556 0.161 1 0.28 0.55 0.27 159 (206) 100 11,9004 3 0.29 0.546 0.256 158 (169) 120 5700

Block 6 2 110 0.395 0.556 0.161 1 0.24 0.59 0.35 182 (360) 100 13,9004 5 0.255 0.552 0.297 211 (435) 100 3300

Block 7 1 110 0.556 0.897 0.341 1 0.393 0.78 0.387 176 (350) 100 13,9004 3 0.44 0.8 0.36 146 (260) 100 4500

Block 8 1 108 0.514 0.897 0.383 1 0.47 0.75 0.28 121 (210) 58 82004 3 0.48 0.75 0.27 129 (181 75 2900

Block 9 2 109 0.185 0.514 0.329 1 0.32 0.59 0.27 202 (374) 100 14,200

oarstptstbui

4

aboo

(ceeQdioWasQe2ebe

TM

N

4 3

Block 10 2 101 0.185 0.514 0.329 1

4 5

Median values were (predominately) independent of the sizef the gain (e.g. Norinder et al., 2001). The variation of the meansround the medians suggests a high variability in results and, indi-ectly, non-normality of the distributions. Since each respondentolved multiple tasks, all tests were repeated to account for clus-ering, i.e., multiple valuations from the same subject tend to beositively correlated due to the common subject-specific charac-eristics such as age, income, and cultural factors. Clustered t-testshowed that this did not significantly impact the results. We alsoested for learning effects, but found no indication that respondentsecame more sensitive to changes in the size of the gain in consec-tive exercises. Finally, we found no evidence of an ordering bias

n our study.

. Discussion

Depending on the normative framework and decision rules

dopted (Claxton et al., 2011), the monetary value of a QALY cane seen as representing the appropriate cost-effectiveness thresh-ld or, if not directly informative in that context, at least provide anpportunity for public discourse on health care limits and decisions

vrp

able 6ultivariate clustered regression analysis with “raw” WTP estimates as the dependent va

DV: log(WTP) Model 1

Coef R st

Log(health gain) 0.05 0.03

Log(duration) 0.03 0.02

Income groupsGroup 1 (≤999 D )

Group 2 (999 and <2000 D )

Group 3 (1999 and <3500 D )

Group 4 (≥3500 D )

Number of people living on household income

Log(age)

Higher education

Gender (1 = female)Intercept 4.52 0.05

n = 401R2 = 0.0

ote. DV, dependent variable; “R st”, robust standard error; p, P > |t|.

0.32 0.56 0.24 216 (374 120 63000.315 0.55 0.235 223 (400) 105 17,9000.296 0.56 0.264 249 (432) 101 2400

Gyrd-Hansen, 2005; Weinstein, 2008). However, before estimatesan be considered useful for decision-making or even public debate,vidence regarding their validity must be provided. Here, we havempirically explored the construct validity of individual WTP perALY estimates using a large-scale study (Bobinac et al., 2010)esigned to obtain monetary values for health gains and to explic-

tly test several aspects of validity. However, our results relate tonly one study using a specific design and methods to obtain bothTP estimates and QALY gain estimates. Although these methods

re not uncommon in this stream of literature and the results thuseem relevant to the general discussion on the validity of WTP perALY estimates, applying different methods may lead to differ-nt results (e.g. using VAS, OE and PS, or the choice of particular9 scenarios). The level of insensitivity, however, seems hardlyxplainable only by the choice of the methods. This is emphasizedy the fact that WTP per QALY estimates are reasonably similar toarlier studies in this area (Bobinac et al., 2010).

Overall, our results lend relatively little support to thealidity of WTP per QALY estimates obtained this way. Theelationship between WTP and QALYs was clearly not ‘nearly pro-ortional’ (NOAA, 1993; Hammitt and Graham, 1999). In fact, only

riable.

Model 2

p Coef R st p

0.07 0.06 0.03 0.020.04 0.04 0.02 0.02

– – –0.37 0.12 0.000.72 0.12 0.001.34 0.16 0.00−0.07 0.03 0.01−0.34 0.11 0.000.26 0.07 0.000.14 0.07 0.04

0.00 5.22 0.4 0.00

8 n = 40181 R2 = 0.12

Page 9: GET MORE, PAY MORE? An elaborate test of construct validity of willingness to pay per QALY estimates obtained through contingent valuation

1 alth E

smiautorurpW

bcsrg

btltt(sadie(

iocrowscoeP

ectWfoamcidfabe(twmdi

tmItmaebgfs

tcaOatqttt(cwaagct

WFopitrshoucdttfmatdnelsmaller VAS gains (in scenarios 1 and 2). It may also have influ-enced the relative valuations of QALY gains high and low on thescale.

13 Providing respondents with an opportunity to form more stable preferences

66 A. Bobinac et al. / Journal of He

poradic statistically significant differences between WTP esti-ates for smaller and larger health gains were found. Complete

nsensitivity in terms of duration is especially worrisome since itppears less cognitively demanding to exhibit sensitivity in case ofnidimensional differences (e.g. 1 vs. 5 years) between scenarioshan in the case of multidimensional health gains. This clear lackf sensitivity, even in the context of a questionnaire that enabledespondents to indicate their WTP while seeing the QALY gainnder valuation on the screen, as a reminder of its size, is wor-isome. The insensitivity of “raw” WTP is directly reflected in WTPer QALY estimates and their sensitivity, and it is worrisome thatTP per QALY estimates depend on the size of the gain on offer.Our results largely confirm the existence of the sub-additivity

ias. While not unexpected (although relatively new in this specificontext), the magnitude of the differences between the sum of twomaller gains and the integral gain was indeed considerable andaises clear questions on how to infer an accurate value for a healthain.

Larger scale sensitivity, in terms of the statistical differenceetween WTP estimates for smaller and larger gains of equal dura-ion, was found in the subgroup of respondents with a higherevel of certainty in their WTP values. This indicates the impor-ance of subgroup analyses because sample-average sensitivityests might lead to over-restrictive conclusions regarding validitye.g. Heberlein et al., 2005). However, the extent of the scale sen-itivity only marginally improved in the subgroup. If, alternatively,

much larger sensitivity was found among more certain respon-ents, it would support the idea that only WTP estimates obtained

n that subgroup are practically useful since these (hypothetical)stimates were found to approach the revealed WTP more closelye.g. Johannesson et al., 1999; Blumenschein et al., 2001).

Budget constraints do not appear to have caused the insensitiv-ty (even though the health gains on offer were not marginal). Thebserved increases in WTP for larger health gains were very smallompared to the increase in health gains, raising obvious questionsegarding the practical meaningfulness or theoretical plausibilityf the estimates. This is compounded by the fact that our resultsere in line with previous studies reporting a non-linear relation-

hip between WTP and QALY gains (even for changes that might beonsidered marginal; Smith, 2005). Some studies have reported anverall insensitivity to scale (e.g. Olsen et al., 2004); in some, WTPven appears to decrease as the health benefit increases (Pinto-rades et al., 2009).

The lack of validity observed in our study could stem from sev-ral sources, among which are design-related problems (despiteareful consideration of various aspects of study design and pilot-esting) and problems in the properties of and relationship between

TP and QALY. With respect to the first problem, there are dif-erent ways in which the estimates of the WTP per QALY can bebtained and it is useful to investigate whether some (design)pproaches perform better, in terms of validity, than others. Itight be hypothesized that replacing the VAS with either TTO or SG,

hanging particular features of the CV or choosing different scenar-os may have improved the validity. For example, the EuroQoL-5Description system might not have provided enough informationor respondents to fully appreciate the severity of health states in

web-based questionnaire. Thus, respondents’ answers may haveeen less thoughtful and more heuristic in nature than would oth-rwise be the case. Since our study was not repeated, the qualityor reliability) of the contingent market could not be tested fur-her. Although “non-realism” is a notable problem in CV exercises,

e avoided it as much as possible through ex post and ex anteitigation. A partial solution to invalidity caused by such proce-

ural problems would be to strongly emphasize the differencesn outcomes (Arrow et al., 1993; Corso et al., 2001). Therefore, in

basas

conomics 31 (2012) 158– 168

his study the gain on the VAS scale achieved with a certain pay-ent was shown and alluded to throughout the WTP exercise.13

f knowing and thinking more about the good in question leadso more valid results then researchers might also consider using a

ore health literate population in future WTP studies. This raisesn important, more general dilemma regarding the validity of WTPstimates. It may well be that the validity of WTP responses maye better in some subgroups than in a representative sample of theeneral population. The question then is, would it be appropriate toocus on sub-groups (losing representativeness) or a representativeample (losing validity)?

Importantly, validity was investigated only in the context ofhe current study and its specific design. It would be relevant toompare the outcome of validity tests in empirical studies thatlso estimated the WTP per QALY but chose other designs (e.g.lsen and Donaldson, 1998; Gyrd-Hansen, 2003). However, avail-ble empirical studies scarcely report in-depth validity checks, withhe exception of Pinto-Prades et al. (2009) who reported resultsuite similar to ours, if not “worse”. In particular, WTP per QALY inhat study varied inversely with the magnitude of health gains andhere was evidence of ordering effects and insensitivity of WTP tohe duration of the period of payment. Pinto-Prades and colleagues2009) employed a Standard Gamble (SG) and a card sorting pro-edure to obtain QALY weights and WTP estimates, respectively,hich is different than the procedure employed here. However, this

pproach did not lead to an increase in the level of validity, lendingdditional support to the suggestion that problems observed hereo beyond design-related issues, including using more sophisti-ated methods such as SG to obtain health state valuations ratherhan the VAS used here.

The problems in the properties of and the relationship betweenTP and QALY, as estimated here, raise three noteworthy issues.

irst, we used the VAS as a means of health state valuation. Notnly has the relevance of VAS valuations sometimes been dis-uted (e.g. Parkin and Devlin, 2006), here it may have resulted

n some noteworthy response patterns. Due to randomization ofhe scenarios (causing some respondents to start with scenario 3ather than 1 or 2), all respondents had to value the three healthtates twice. As mentioned, the VAS valuation of the two extremeealth states (the lowest one in scenarios 1 and 3 and the highestne in scenarios 2 and 3) was nearly identical at the two val-ation moments. However, the middle health state was valuedlearly differently between the two moments. Perhaps respon-ents tried to clearly differentiate between the health states onhe VAS while simultaneously being influenced by ceiling effects ofhe VAS. The latter implies that shifting the extreme health statesurther downward or upward was not an option, so that that place-

ent of the middle state had to be shifted between scenarios 1nd 2. This implies that our sub-additivity testing was imperfect inerms of indicated utility distance (although correct in terms of theescribed health gain). Although important to note, the result doesot seem to have caused the observed insensitivity. It may, how-ver, have contributed to the observed sub-additivity bias, as theargest VAS gain (in scenario 3) was lower than the sum of the two

y restating them in repeated interviews is another option. Heberlein et al. (2005)nd MacMillan et al. (2006) are among many who have found that WTP changedignificantly when respondents were given additional information and time to thinkbout an unfamiliar environmental good, thus better forming their preferences. Thiseems to be an interesting line for future research.

Page 10: GET MORE, PAY MORE? An elaborate test of construct validity of willingness to pay per QALY estimates obtained through contingent valuation

alth E

tbbigehwifo

ticdtvpoetDmQrtAs

gotsisicvfsWeeilitoTioiuhmthss

cfs

prvotp

A

sbBamt

R

A

B

B

B

B

B

B

B

B

B

C

C

C

C

C

D

D

D

D

D

A. Bobinac et al. / Journal of He

Second, generally, WTP may measure broader outcomes thanhe QALY does, and therefore, a (nearly) proportional relationshipetween the two quantities need not exist. Given that there maye more associated with improved health than health-related util-

ty (e.g. income changes), the relationship between WTP and QALYains may seem to behave unexpectedly. It seems unlikely, how-ver, that such considerations could explain the results presentedere in a convincing manner. It is hard to see for instance how itould explain that an additional 0.2 QALY gain would yield a D 20

ncrease in monthly WTP in a sample of the Dutch population. Inact, it would rather be expected to induce an increased sensitivityf WTP per QALY estimates for (some ranges of) health gains.

Third, generally speaking, there is a difference between valua-ions focusing on health states (after which the size of a health gains calculated) and WTP valuations, which directly value a healthhange. In contrast to a health state valuation exercise, respon-ents in any WTP exercise have the opportunity to consider bothhe origin and destination of the gain, potentially influencing theiraluation (Weinstein et al., 2009). It seems that using differentrocedures to value changes thus may hamper the comparisonf methods and influence the observed validity of WTP per QALYstimates. This issue may be related to recent debates regardinghe optimal way to value health improvements (Nord et al., 2008;rummond et al., 2009). Moreover, differences between valuationethods could arise if the properties underlying the conventionalALY model do not hold. It has been suggested, therefore, that

esearchers consider non-linear specifications of the QALY modelo see if it adds to validity of findings (Pinto-Prades et al., 2009).lthough clearly important, our results seem not easily explainedolely by violations of the QALY model.

Given the evidence regarding problems with the CV method ineneral and in the context of health gains specifically, the questionf whether our results indeed do not reflect inherent problems withhe method seems justified. Hammitt and Graham (1999), con-idering WTP for risk reductions, note “. . .additional research onmproving the application of CV to health risk is warranted. We testeveral variations in CV instrument design, but obtain only modestmprovements in sensitivity to probability” (p. 34). In that sense,omparing the validity of different methods to derive monetaryalues for a QALY (WTP, DCE, etc.) may be an interesting issue forurther research (Tilling et al., 2009). Our study has at least empha-ized that caution is required in directly considering outcomes of

TP per QALY estimates – including those published in Bobinact al. (2010) – when studies fail to provide convincing and extensivevidence on the validity of such estimates. While we do not wish tomply that practically meaningful WTP estimates cannot be estab-ished, in spite of the results presented here or those presentedn Pinto-Prades et al. (2009), our findings provide an indication ofhe type and extent of problems in similar studies and the depthf inquiry required to make claims about the validity of results.heoretical validity is a relatively undemanding requirement, butnsufficient to demonstrate construct validity. If sound estimatesf WTP for health gains are sought, it is pivotal to understand thensensitivity of the estimates as reported here, and, if possible, tonravel (and ideally counter) its causes. Whether this is possibleas, in fact, been doubted (e.g. Kahneman et al., 1999). Furtherethodological analysis and testing thus seems necessary to inves-

igate whether and how the CV method can meaningfully informealth care decision making (Klose, 1999). Possibly, validity testinghould become an integral part of piloting a CV study since at thattage there is still room for improvement.

For now, the theoretical validity and especially the practi-al meaningfulness or theoretical plausibility of WTP estimatesor health gains in general appear to be insufficiently demon-trated in order to consider current estimates useful for informing

D

T

conomics 31 (2012) 158– 168 167

olicymaking or public debate. Future studies need to convinceeaders not only of the theoretical validity of the estimates of thealue per QALY, but also of their construct validity, that is, their the-retical plausibility. In that sense, it is critical to pay more attentiono such aspects in future studies in order to get more valid results:ay more, get more, therefore.

cknowledgements

This study is part of a larger project investigating the broaderocietal benefits of health care, which was financially supportedy Astra-Zeneca, GlaxoSmithKline, Janssen-Cilag, Merck and PfizerV. The researchers were free in study design, data collection,nalysis and interpretation, as well as writing and submitting theanuscript for publication. The views expressed in this paper are

hose of the authors.

eferences

rrow, K., Solow, R., Leamer, E., Radner, R., Schuman, H., 1993. Report of the NOAAPanel on contingent valuation. Federal Register 58, 4601–4614.

aker, R., Currie, G.R., Donaldson, C., 2010. What needs to be done in contingentvaluation: have xmissed the boat? Health Economics, Policy and Law 5, 113–121.

ateman, I.J., Brouwer, R., 2006. Consistency and construction in stated WTP forhealth risk reductions. A novel scope sensitivity test. Resource and Energy Eco-nomics 28, 199–214.

ateman, I.J., Jones, A.P., 2003. Contrasting conventional with multi-level model-ing approaches to meta-analysis: expectation consistency in U.K. woodlandrecreation values. Land Economics 79, 235–258.

hatia, M.R., Fox-Rushby, J.A., 2003. Validity of willingness to pay: hypotheticalversus actual payment. Applied Economics Letters 10, 737–740.

lomquist, G., Blumenschein, K., Johannesson, M., 2009. Eliciting willingness topay without bias using follow-up certainty statements: comparisons betweenprobably/definitely and a 10-point certainty scale. Environmental and ResourceEconomics 43, 473–502.

lumenschein, K., Johannesson, M., Yokoyama, K.K., Freeman, P.R., 2001. Hypothet-ical versus real willingness to pay in the health care sector: results from a fieldexperiment. Journal of Health Economics 20, 441–457.

obinac, A., van Exel, N.J.A., Brouwer, W.B.F., Rutten, F., 2010. Willingness to pay fora QALY: the individual perspective. Value in Health 13, 1046–1055.

radford, D.F., 1972. Benefit-cost analysis and demand curves for public goods.Kyklos 23, 775–791.

yrne, M.M., O‘Malley, K., Suarez-Almazor, M.E., 2005. Willingness to pay perquality-adjusted life year in a study of knee osteoarthritis. Medical DecisionMaking 25, 655–666.

ameron, A.T., Quiggin, J., 1994. Estimation using contingent valuation data froma dichotomous choice with follow-up questionnaire. Journal of EnvironmentalEconomics and Management 27, 218–234.

arson, R., Mitchell, R., 1995. Sequencing and nesting in contingent valuation sur-veys. Journal of Environmental Economics and Management 28, 155–173.

BS. http://www.cbs.nl/en-GB/menu/cijfers/kerncijfers/default.htm (accessed21.03.09).

laxton, K., Paulden, M., Gravelle, H., Brouwer, W.B.F., Culyer, A.J., 2011. Discountingand decision making in the economic evaluation of health care technologies.Health Economics 20, 2–15.

orso, P., Hammitt, J., Graham, J., 2001. Valuing mortality-risk reduction: usingvisual aids to improve the validity of contingent valuation. Journal of Risk andUncertainty 23, 165–184.

esvousges, W.H., Johnson, F.R., Dunford, R.W., Boyle, K.J., Hudson, S.P., Wilson, K.N.,1993. Measuring natural resource damages with contingent valuation: tests ofvalidity and reliability. In: Hausman, J.A. (Ed.), Contingent Valuation: A CriticalAssessment. Amsterdam, North-Holland, pp. 91–159.

onaldson, C., Shackley, P., Abdalla, M., Miedzybrodzka, Z., 1995. Willingness to payfor antenatal carrier screening for cystic fibrosis. Health Economics 4, 439–452.

onaldson, C., Shackley, P., Abdalla, M., 1997. Using willingness to pay to value closesubstitutes: carrier screening for cystic fibrosis revisited. Health Economics 6,145–159.

onaldson, C., Baker, R., Mason, H., Jones-Lee, M., Lancsar, E., Wildman, J., Bateman,I., Loomes, G., Robinson, A., Sugden, R., Pinto Prades, J.L., Ryan, M., Shackley, P.,Smith, R., 2011. The social value of a QALY: raising the bar or barring the raise?BMC Health Services Research 11, 8.

rummond, M., Brixner, D., Gold, M., Kind, P., McGuire, A., 2009. Nord E and Con-sensus Development Group (2009), toward a consensus on the QALY. Value in

Health 12, 31–35.

ubourg, W.R., Jones-Lee, M.W., Loomes, G., 1997. Imprecise preferences and surveydesign in contingent valuation. Economica 64, 681–702.

he EuroQol Group, 1999. EuroQol – a new facility for the measurement of healthrelated QoL. Health Policy 16, 199–208.

Page 11: GET MORE, PAY MORE? An elaborate test of construct validity of willingness to pay per QALY estimates obtained through contingent valuation

1 alth E

F

F

F

G

G

H

H

J

J

J

J

J

K

K

K

K

L

L

M

M

N

N

N

O

O

P

P

R

S

S

S

S

S

T

V

V

W

68 A. Bobinac et al. / Journal of He

isher, A.C., 1996. The conceptual underpinnings of the contingent valuationmethod. In: Bjornstad, D.J., Kahn, J.R. (Eds.), The Contingent Valuation of Envi-ronmental Resources: Methodological Issues and Research Needs. Edward Elgar,Cheltenham, pp. 19–37.

lores, N., Carson, R., 1997. The relationship between the income elasticities ofdemand and willingness to pay. Journal of Environmental Economics and Man-agement 33, 287–295.

rederick, S., Fischhoff, B., 1998. Scope insensitivity in elicited valuations. Risk, Deci-sion and Policy 3, 109–123.

yrd-Hansen, D., 2003. Willingness to pay for a QALY. Health Economics 12,1049–1060.

yrd-Hansen, D., 2005. Willingness to pay for a QALY: theoretical and methodolog-ical issues. Pharmacoeconomics 23, 423–432.

ammitt, J.K., Graham, J.D., 1999. Willingness to pay for health protection: inade-quate sensitivity to probability? Journal of Risk and Uncertainty 18, 32–62.

eberlein, T.A., Wilson, M.A., Bishop, R.C., Schaeffer, N.C., 2005. Rethinking the scopetest as a criterion for validity in contingent valuation. Journal of EnvironmentalEconomics and Management 50, 1–22.

akobsson, K., Dragun, J., 1996. Contingent Valuation and Endangered Species.Edward Elgar, Cheltenham, UK and Brookfield, US.

ohannesson, M., Blomquist, G.C., Blumenschein, K., Johansson, P., Liljas, B., O’Conor,R.M., 1999. Calibrating hypothetical willingness to pay responses. Journal of Riskand Uncertainty 18, 21–32.

ohnson, R., Banzhaf, M.R., Desvousges, W.H., 2000. Willingness to pay for improvedrespiratory and cardiovascular health: a multiple-format, stated-preferenceapproach. Health Economics 9, 295–317.

ones-Lee, M., Loomes, G., Philips, P., 1995. Valuing the prevention of non-fatal roadinjuries: contingent valuation vs. standard gambles. Oxford Economic Papers47, 676–695.

orgensen, B.S., Syme, G.J., Smith, L.M., Bishop, B.J., 2004. Random error in willing-ness to pay measurement: a multiple indicators, latent variable approach to thereliability of contingent values. Journal of Economic Psychology 25, 41–59.

ahneman, D., Ritov, I., Schkade, D., 1999. Economic preferences or attitude expres-sions? An analysis of dollar responses to public issues. Journal of Risk andUncertainty 19, 203–235.

ind, P., Dolan, P., Gudex, C., Williams, A., 1998. Variations in population healthstatus: results from a United Kingdom national questionnaire survey. BMJ 316,736–741.

ing Jr., J.T., Tsevat, J., Lave, J.R., Roberts, M.S., 2005. Willingness to pay for a Quality-Adjusted Life year: implications for societal health care resource allocation.Medical Decision Making 25, 667–677.

lose, T., 1999. The contingent valuation method in health care. Health Policy 47,97–123.

amers, L.M., McDonnell, J., Stalmeier, P.F., 2006. The Dutch tariff: results andarguments for an effective design for national EQ-5D valuation studies. HealthEconomics 15, 1121–1132.

ienhoop, N., MacMillan, D., 2007. Valuing wilderness in Iceland: estimation of WTAand WTP using the market stall approach to contingent valuation. Land UsePolicy 24, 289–295.

acMillan, D., Hanley, N., Lienhoop, N., 2006. Contingent valuation: environmentalpolling or preference engine. Ecological Economics 60, 299–307.

W

Y

conomics 31 (2012) 158– 168

cFadden, D., Leonard, G., 1993. Issues in the contingent valuation of environmentalgoods: methodologies for data collection and analysis. In: Hausman, J.A. (Ed.),Contingent Valuation: A Critical Assessment. Amsterdam, North-Holland, pp.91–159.

OAA Panel. http://www.cbe.csueastbay.edu/∼alima/courses/4306/articles/NOAA%%20on%%20contingent%%20valuation%%201993.pdf (accessed 14.04.10).

ord, E., Daniels, N., Kamlet, M., 2008. QALYs: some challenges. Value Health 12(Suppl. 1), 10–15.

orinder, A., Krister, H., Ulf, P., 2001. Scope and scale insensitivities in a contingentvaluation study of risk reductions. Health Policy 57, 141–153.

lsen, J.A., Donaldson, C., 1998. Helicopters, hearts and hips: using willingness topay to set priorities for public sector health care programmes. Social Scienceand Medicine 46, 1–12.

lsen, J.A., Donaldson, C., Pereira, J., 2004. The insensitivity of ‘willingness-to-pay’to the size of the good: new evidence for health care. Journal of EconomicPsychology 25, 445–460.

arkin, D., Devlin, N., 2006. Is there a case for using visual analogue scale valuationsin cost-utility analysis? Health Economics 15, 653–664.

into-Prades, J.L., Loomes, G., Brey, R., 2009. Trying to estimate a monetary value forthe QALY. Journal of Health Economics 28, 553–562.

yan, M., 2004. A comparison of stated preference methods for estimating monetaryvalues. Health Economics 13, 291–296.

hiroiwa, T., Sung, Y., Fukuda, T., Lang, H., Bae, S., Tsutani, K., 2010. Internationalsurvey on willingness-to-pay (WTP) for one additional QALY gained: what isthe threshold of cost effectiveness? Health Economics 4, 422–437.

mith, R.D., 2001. The relative sensitivity of willingness-to-pay and time-trade-offto changes in health status: an empirical investigation. Health Economics 10,487–497.

mith, R.D., 2005. Sensitivity to scale in contingent valuation: the importance of thebudget constraint. Journal of Health Economics 24, 519–529.

mith, R.D., 2006. It’s not just what you do, it’s the way that you do it: the effectof different payment card formats and survey administration on willingness topay for health gain. Health Economics 15, 281–293.

mith, R.D., Sach, T.H., 2010. Contingent valuation: what needs to be done? HealthEconomics, Policy and Law 5, 91–111.

illing, C., Krol, M., Tsuchiya, A., Brazier, J., van Exel, J., Brouwer, W., 2009. Measuringthe value of life: exploring a new method for deriving the monetary value of aQALY. University of Sheffield Working Paper 09-14, Sheffield University.

an Exel, N.J.A., Brouwer, W.B.F., van den Berg, B., Koopmanschap, M.A., 2006. Witha little help from an anchor: discussion and evidence of anchoring effects incontingent valuation. Journal of Socio-Economics 35, 836–853.

an Houtven, G., Powers, J., Jessup, A., Yang, J., 2006. Valuing avoided morbidityusing meta-regression analysis: what can health status measures and QALYstell us about WTP? Health Economics 15, 775–795.

einstein, M., 2008. How much are Americans willing to pay for a Quality-AdjustedLife Year? Medical Care 46, 343–345.

einstein, M., McGuire, A., Torrance, G., 2009. QALYs: the basics. Value Health 12(Suppl. 1), S5–S9.

eung, R., Smith, R.D., McGhee, S.M., 2003. Willingness to pay and size of healthbenefit: an integrated model to test for ‘sensitivity to scale’. Health EconomicsLetters 12, 791–796.