automatic learning of adaptive treatment strategies using

Automatic learning of adaptive treatment strategies

using kernel-based reinforcement learning

Presented by: Joelle Pineau, School of Computer Science, McGill University

Joint work with: Marc G. Bellemare, Adrian Ghizaru, Zaid Zawaideh, Susan Murphy (Univ. Michigan), John Rush (UT Southwestern Medical Center)

Biostatistics Seminar, McGill UniversityMarch 7, 2007

Automatic learning of adaptive treatment strategies Joelle Pineau

Outline

• Background on adaptive treatment design

• Reinforcement learning primer

• Automatic construction of treatment strategies

• Preliminary results from the STAR*D trial

• Discussion


Adaptive treatment strategies

• Individually tailored treatments, where treatment type, dosage andduration varies according to patient outcomes.

• Goal is to improve longer-term outcomes, as opposed to focusing ononly short-term benefits, for patients with chronic disorders.

• Recently developed for treatment of chronic illnesses, e.g.» Brooner et al. (2002) Treatment of Cocaine Addiction» Breslin et al. (1999) Treatment of Alcohol Addiction» Prokaska et al. (2001) Treatment of Tobacco Addiction» Rush et al. (2003) Treatment of Depression


Example of an adaptive treatment strategy

Following graduation from an intensive outpatient program, alcohol dependent

patients are provided naltrexone (Treatment 1). In the ensuing two months

patients are monitored.

If during that time the patient has 5 or more heavy drinking days (Nonresponder),

then the medication is switched to acamprosate (Treatment 2a).

If the patient reports no more than 4 heavy drinking days (Responder), then the

patient is continued on naltrexone and monitored monthly for signs of relapse

(Treatment 2b).


Why use adaptive treatment strategies?

• In Prevention

– Individuals at risk of problem behaviors for different reasons (multiple

causes).

– Periods of high and low risk.

• In Treatment– High heterogeneity in response to any one treatment.

» What works for one person may not work for another.

– Improvement often marred by relapse.» What works now for a person may not work later.

– Co-occurring disorders may be common.


The big questions

• What is the best sequencing of treatments?

• What information do we use to make these decisions?


SMART Trials

• SMART = Sequential Multiple Assignment RandomizedTrial [Murphy, 2005]

• These are multi-stage trials; conceptually a randomizationtakes place at each stage.

• Goal is to inform the construction of an adaptive treatmentstrategy.


Example of a SMART study

Legend:

MedA = Naltrexone

MedB = Acamprosate

CBT= Cognitive Behavioral Therapy

TDM = Telephone Disease Management

EM = Motivation therapy for adherence


Question

• Why not use data from multiple trials to construct the dynamictreatment regime?

Choose best initial treatment on the basis of a randomized trial ofinitial treatments.

Choose the best secondary treatment on the basis of a randomizedtrial of secondary treatments.

Etc.


Delayed effects

Negative synergies:

Initial treatment may produce higher proportion of responders, but alsoresult in side effects that are sufficiently high so that nonresponders areless likely to adhere to subsequent treatments.

Positive synergies:

A treatment may not be best initially but may have enhanced long termeffectiveness when followed by a particular maintenance treatment.

Diagnostic effects:

Initial treatment may elicit informative patient response that permits betterchoice of subsequent treatment.


Examples of SMART designs:

• Thall et al. (2000) Treatment of Prostate Cancer

• CATIE (2001) Treatment of Psychosis in Schizophrenia

• STAR*D (2003) Treatment of Depression

• Oslin (on-going) Treatment of Alcohol Dependence


STAR*D

• STAR*D = Sequenced Treatment Alternatives to Relieve Depression– Largest study of depression (14 regional centers, 4041 patients).– Longitudinal (5 years), NIMH-sponsored.– Multiple patient randomizations (up to 4 treatments: medication or

psychotherapy).

Entranceevaluation

Firsttreatment

Firstevaluation

Fourthtreatment

Fourthevaluation…


STAR*D Randomizations (slightly simplified!)

Level 1 Treatment CIT

Level 2 Treatment options SER BUP VEN CT CIT+BUP CIT+BUS CIT+CT Treatment strategy Switch Augment

Level 3 Treatment options MRT NTP SER+Li BUP+Li VEN+Li CIT+Li Treatment strategy Switch Augment

Level 4 Treatment options TCP VEN+MRT Treatment strategy Switch


Clinical data

• Each evaluation measures 20+ variables, e.g.– Quick inventory of depressive symptoms (QIDS)– Frequency and intensity of side-effects– Work and productive activity impairment– Quality of life enjoyment and satisfaction– Medication compliance

• Research outcomes measured (during clinic visit or telephone interview) at:– beginning of treatment level,– every ~6 weeks during treatment,– exit from treatment level,– every month during follow-up.

• Remission = QIDS at end of treatment level falls below threshold (QIDS≤5).


Research objectives for computer scientists

• Long-term:

– Develop new methodologies for automatic learning of adaptive

treatment strategies.

• Current project:

– Automatically construct decision rules from the STAR*D trial

dataset using reinforcement learning techniques.


Decision rules

• Adaptive treatment strategies are composed of a series of decisionrules, one per treatment step.

• Decision rule:IF baseline assessments = Z0

AND prior treatments at Steps 1 through t = {A1, …, At} AND clinical assessments at Steps 1 through t = {Z1…, Zt}THEN at Step t+1 apply treatment At+1

Goal is to learn such rules directly from data.


An adaptive treatment strategy in STAR*D

Patient is diagnosed with depression. Treat with citalopram.

If the patient’s depression remits and medication is tolerated, then retain patienton citalopram.

If the patient does not remit, or cannot tolerate citalopram, then ascertainpatient’s preference for a medication switch treatment augmentation.

If the patient prefers a switch in treatment then provide sertraline.

If the patient prefers an augmentation and the patient was poorly adherent tocitalopram then provide cognitive therapy in addition to citalopram.

If the patient prefers an augmentation and the patient was adherent to citalopram,then provide bupropion in addition to citalopram.


Learning decision rules

• Specify set of candidate decision rules.

• For each decision rule, evaluate:

– its expected reward (or cost),

– its expected effect.

• Select sequence of decision rules with highest expected cumulativereward (or lowest cost).

• Formal framework for such problem: Reinforcement learning.


What is reinforcement learning (RL)?

• Inspired by trial-and-error learning studied in the psychology of animallearning:

– good actions are positively reinforced,

– poor actions are negatively reinforced.

• Formalized in computer science and operations research [Bellman,1957; Sutton&Barto, 1998].


What is reinforcement learning (RL)? (cont’d)

• Mathematical framework to estimate usefulness of taking sequences ofactions in an evolving, time varying, system.

Intuition: A decision-maker tries (random) actions and observes their

effect until it gradually learns when to apply which action.

• Used to evaluate treatment based on immediate and long-termeffects.

• Can evaluate both therapeutic and diagnostic effects.


A reinforcement learning problem is described by:

States: What we know– History of measurements and medications

Actions: What we can do– Treatment choice: CIT, CIT+CT, BUP, psychotherapy, …

Effects: How the action affects the state– E.g. QIDSt=1 = 20 + A=CIT → QIDSt=2=15

Rewards: Time-dependent outcomes– Duration of treatment, final outcome (remission, no remission, left

trial), side effect burden, …


A graphical model of the STAR*D trial

where s = statea = actione = effectr = reward

s0

Pretreatment data:Demographics

Screening assessmentsBaseline assessments

Baselinee IVRs

a1=CIT

s1

Level 1 data:Treatment preferenceAssigned treatmentClinical assessments

IVR assessmentsDuration

a2=BUP

s2



…e

r1

r2r2r0

e


Reinforcement learning objective

• To learn the best treatment strategy for each state.

“best” = the one that maximizes a given reward function.

• The reward function r(s) is a numerical indicator of whether a state is goodor bad, capturing time-varying outcomes, e.g.:

– Depression score (e.g. QIDS) [Lower score = higher reward]

– Severity of side-effects [Lower score = higher reward]

– Length of treatment [Shorter length = higher reward]


Reinforcement learning in STAR*D

Q: Which treatment a2 has highest total (immediate + future) reward?

s0

Pretreatment data:Demographics

Screening assessmentsBaseline assessments

Baselinee IVRs

a1=CIT

s1



a2=??

s2



…e

r1

r2r2r0

e


Maximizing the reward function

• To maximize the expected total reward:E [ r (s0) + r (s1) + r (s2) + r (s3) + r (s4) ]

• We define the value function: Vt(s) = maxa [ r (s,a,s’) + ∑s’ Pa

ss’ Vt+1(s’) ]

where Pass’ captures the effect of treatment on state.

• If we know Pass’, r(s,a,s’), then we can evaluate by dynamic programming.

• Problem: we lack a model to estimate Pass’ !


Kernel-based Reinforcement Learning

Consider value function: Vt(s) = maxa ∑s’ Pa

ss’ [ r(s,a,s’) + Vt+1(s’) ]

Approximate using kernel-based averaging: Vt(si) = maxa ∑<s,a,s’> wa(si, s) [ r(s,a,s’) + Vt+1(s’) ]

where <s, a, s’> is a given data instance

and wa(si, sj) = φ(||si-sj||/b) is a weight kernel ∑<s,a,s’>φ(||si-s||/b)

Provides a natural way of incorporating measures of patient similarity.


Specifics of kernel regression

We assume a Gaussian kernel: φ(||si-sj||/b) = exp(-d2(si,sj)/2σ2 ).

We allocate one kernel function per training example <s, a, s’>.

The width of the kernel σ2 controls the neighbourhood of influence of eachtraining sample.

– σ2=0 means no generalization to unseen states.

– σ2=∞ means every state has (identical) mean value.

The distance function d2(si,sj) measures similarity between patients via theirvalues on the independent variables.


Bias-variance trade-off

• Width of the kernel, σ2, controls the neighbourhood of influence ofeach training sample.

– Small bandwidth = more variance.

– Large bandwidth = more bias.

• Bandwidth should be chosen to minimize bias-variance trade-off.

– Need to shrink bandwidth as sample size increases.

– With the right shrinking-rate, we can guarantee:

||Vt - Vt*|| →P 0 as m→∞


Instance-based Reinforcement Learning

Estimate non-parametric value function using kernel regression:Vt(s) = maxa ∑s’ wa

ss’ Vt+1(s’)

CITPretreatment data

Level 1 data

Patient n


Level 1 data

Patient n

BUP


Level 1 data

Patient n

VEN


Level 1 data

Patient n

CIT+BUP. . .


Level 1 data

Patient 1 . . .


Level 1 data

Patient 2

BUPLevel 2 data


Level 1 data

Patient 3

BUPLevel 2 data

BUP+LILevel 3 data


Level 1 data

Patient 4

VENLevel 2 data


Level 1 data

Patient N

SERLevel 2 data

NTPLevel 3 data

consider all acceptable treatments

s’ sweeps overpatient database

wass’ weight

proportionalto similarityof measurements

Vt+1 Vt+1


Caveat #1: Feature selection

• State representation at level i:si = pretreatment data + treatment 1 + measurements at level 1 +… +

treatment i + measurements at level i

• Total state representation:– dozens of variables, multi-valued (up to 60+ values in some cases)

• Urgent need for feature selection! But…– Data is extremely sparse.– Feature selection in decision-making is an open research question.

• Currently, select features by hand based on discussion with clinicians.


Caveat #2: Reward elicitation

• Reward function should be provided by clinicians.

• How do we elicit this information from them?

– Reward needs to have sufficient discriminative information.

– Reward is an abstract concept for a clinician.

– Clinicians may not be consistent in reward assessment.

• Results presented today assume hand-crafted reward function.

• Ongoing work on decision-theoretic method for eliciting preferences[Zawaideh & Pineau, submitted].


Current implementation

State variables: Pre-treatment depression score QIDS-C0={1, .., 25}Switch/Augment preference during levels 1… t.Gradient over depression scores during levels 1… t.Treatments at levels 1… t.

Decisions: Level 2 and 3 treatments

Reward function: r = 1 if QIDS-C ≤ 5r = 0 otherwise


Level 2 treatment - Switch group

Lev

el 1

QID

S-C

Entrance QIDS-C


Level 2 treatment - Augment group


Level 2 myopic treatment - Augment group


Value function at level 2

• Recall value function: Vt(s) = maxa ∑s’ wass’ Vt+1(s’)

QIDS at exit from level 0

QID

S at

exi

t fro

m le

vel 1


Estimated Level 3 valuefor patients with pretreatment QIDS-C = 15 and BUP as Level 2 treatment

Level 1 QIDS-C

Lev

el 2

QID

S-C


QID

S at

exi

t fro

m le

vel 2


Level 1 QIDS-C

Lev

el 2

QID

S-C

Estimated Level 3 valuefor patients with pretreatment QIDS-C = 15 and SER as Level 2 treatment


QID

S at

exi

t fro

m le

vel 2


Summary

• Adaptive treatment strategies can be constructed using:

– data from SMART studies

– reinforcement learning analysis.

• The analysis highlights the role of treatment as a diagnostic tool.

• The analysis presented is generating new hypothesis regarding thetreatment of depression.


Clinical challenges

• Establishing close level of collaboration between clinical researchersand methodologists.

• Designing confirmatory trial for learned adaptive strategies.

• Performing preference (reward) elicitation studies.


Methodological challenges

• Extending RL methods to provide confidence measures.

• Using RL with data from observational studies.

• Improving problem design:» variables that are predictive of costs/rewards,

» variables that interact with treatment.


Related ongoing project

Learning adaptive treatment strategies for epilepsy(joint work with Massimo Avoli, MNI)

– Similar RL framework, different function approximation (not kernel RL).

– Gather data from generative model of epilepsy (rather than SMART trial).

– Faster treatment cycle, strategy can adapt to patient over time.


Follow-up

• Recent paper (proofs available by email: [email protected]):Pineau, Ghizaru, Bellemare, Rush & Murphy. ``Improving the management ofchronic disorders by learning adaptive treatment strategies". Drug and AlcoholDependence, Supplemental Issue on Adaptive Treatment Strategies. Elsevier. Inpress.

• Grad student position available for this project.

• Acknowledgements:– Data generously provided by the STAR*D team.– Partial funding provided by NIH (grant R21 DA019800).– Some slide material borrowed from:

http://www.stat.lsa.umich.edu/~samurphy/nida/seminars.html

automatic learning of adaptive treatment strategies using

Documents