the exploration-exploitation trade-offexploitation trade-o pantelis pipergias analytis...

Post on 01-Sep-2020

24 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Theexploration-exploitation

trade-off

PantelisPipergiasAnalytis

Exploration-exploitationproblems

Themulti-armedbanditframework

Strategies

Contextualbandits

Results from areal worldexperiment

Conclusions

The exploration-exploitation trade-off

Pantelis Pipergias Analytis

Cornell University

February 5, 2018

1 / 75

Theexploration-exploitation

trade-off

PantelisPipergiasAnalytis

Exploration-exploitationproblems

Themulti-armedbanditframework

Strategies

Contextualbandits

Results from areal worldexperiment

Conclusions

Examples of exploration and exploitation in real life

1 Going to your favorite restaurant/bar vs. trying a new one.

2 Listening to music from a band you love vs. discoveringnew ones.

3 Preparing a meal that you have made successfully in thepast and you enjoyed vs. cooking up something new.

4 Reading a newspaper article from a journalist you like vs.reading something from a newcomer in the field.

5 A chimpanzee foraging in a new territory with unknownfood resources as opposed to the known home territory.

6 An organization trying a new organizational structure vs.a decently working existing one.

2 / 75

Theexploration-exploitation

trade-off

PantelisPipergiasAnalytis

Exploration-exploitationproblems

Themulti-armedbanditframework

Strategies

Contextualbandits

Results from areal worldexperiment

Conclusions

Examples of exploration and exploitation in real life

1 Going to your favorite restaurant/bar vs. trying a new one.

2 Listening to music from a band you love vs. discoveringnew ones.

3 Preparing a meal that you have made successfully in thepast and you enjoyed vs. cooking up something new.

4 Reading a newspaper article from a journalist you like vs.reading something from a newcomer in the field.

5 A chimpanzee foraging in a new territory with unknownfood resources as opposed to the known home territory.

6 An organization trying a new organizational structure vs.a decently working existing one.

2 / 75

Theexploration-exploitation

trade-off

PantelisPipergiasAnalytis

Exploration-exploitationproblems

Themulti-armedbanditframework

Strategies

Contextualbandits

Results from areal worldexperiment

Conclusions

Examples of exploration and exploitation in real life

1 Going to your favorite restaurant/bar vs. trying a new one.

2 Listening to music from a band you love vs. discoveringnew ones.

3 Preparing a meal that you have made successfully in thepast and you enjoyed vs. cooking up something new.

4 Reading a newspaper article from a journalist you like vs.reading something from a newcomer in the field.

5 A chimpanzee foraging in a new territory with unknownfood resources as opposed to the known home territory.

6 An organization trying a new organizational structure vs.a decently working existing one.

2 / 75

Theexploration-exploitation

trade-off

PantelisPipergiasAnalytis

Exploration-exploitationproblems

Themulti-armedbanditframework

Strategies

Contextualbandits

Results from areal worldexperiment

Conclusions

Examples of exploration and exploitation in real life

1 Going to your favorite restaurant/bar vs. trying a new one.

2 Listening to music from a band you love vs. discoveringnew ones.

3 Preparing a meal that you have made successfully in thepast and you enjoyed vs. cooking up something new.

4 Reading a newspaper article from a journalist you like vs.reading something from a newcomer in the field.

5 A chimpanzee foraging in a new territory with unknownfood resources as opposed to the known home territory.

6 An organization trying a new organizational structure vs.a decently working existing one.

2 / 75

Theexploration-exploitation

trade-off

PantelisPipergiasAnalytis

Exploration-exploitationproblems

Themulti-armedbanditframework

Strategies

Contextualbandits

Results from areal worldexperiment

Conclusions

Examples of exploration and exploitation in real life

1 Going to your favorite restaurant/bar vs. trying a new one.

2 Listening to music from a band you love vs. discoveringnew ones.

3 Preparing a meal that you have made successfully in thepast and you enjoyed vs. cooking up something new.

4 Reading a newspaper article from a journalist you like vs.reading something from a newcomer in the field.

5 A chimpanzee foraging in a new territory with unknownfood resources as opposed to the known home territory.

6 An organization trying a new organizational structure vs.a decently working existing one.

2 / 75

Theexploration-exploitation

trade-off

PantelisPipergiasAnalytis

Exploration-exploitationproblems

Themulti-armedbanditframework

Strategies

Contextualbandits

Results from areal worldexperiment

Conclusions

Examples of exploration and exploitation in real life

1 Going to your favorite restaurant/bar vs. trying a new one.

2 Listening to music from a band you love vs. discoveringnew ones.

3 Preparing a meal that you have made successfully in thepast and you enjoyed vs. cooking up something new.

4 Reading a newspaper article from a journalist you like vs.reading something from a newcomer in the field.

5 A chimpanzee foraging in a new territory with unknownfood resources as opposed to the known home territory.

6 An organization trying a new organizational structure vs.a decently working existing one.

2 / 75

Theexploration-exploitation

trade-off

PantelisPipergiasAnalytis

Exploration-exploitationproblems

Themulti-armedbanditframework

Strategies

Contextualbandits

Results from areal worldexperiment

Conclusions

Multi-armed bandit (MAB) problem

Option 1

??

N(µ1, σ1)

N(12, 3)

Option 2

7.417.3

N(µ2, σ2)

N(15, 3)3 / 75

Theexploration-exploitation

trade-off

PantelisPipergiasAnalytis

Exploration-exploitationproblems

Themulti-armedbanditframework

Strategies

Contextualbandits

Results from areal worldexperiment

Conclusions

History of the problem

1 The [MAB] problem was formulated during the war, andefforts to solve it so sapped the energies and minds ofAllied scientists that the suggestion was made that theproblem be dropped over Germany, as the ultimateinstrument of intellectual sabotage. — Whittle (1980)

2 The first papers and strategies on the topic were writtenby Thompson (1933) and Robins (1952).

3 Bellman and Gittins provided backward looking andforward looking solutions to the problem.

4 Today the MAB framework is behind numerous algorithmsthat are used in the online world.

5 Note the similarities to the search problem consideredlast week: the problems fold into each other

4 / 75

Theexploration-exploitation

trade-off

PantelisPipergiasAnalytis

Exploration-exploitationproblems

Themulti-armedbanditframework

Strategies

Contextualbandits

Results from areal worldexperiment

Conclusions

History of the problem

1 The [MAB] problem was formulated during the war, andefforts to solve it so sapped the energies and minds ofAllied scientists that the suggestion was made that theproblem be dropped over Germany, as the ultimateinstrument of intellectual sabotage. — Whittle (1980)

2 The first papers and strategies on the topic were writtenby Thompson (1933) and Robins (1952).

3 Bellman and Gittins provided backward looking andforward looking solutions to the problem.

4 Today the MAB framework is behind numerous algorithmsthat are used in the online world.

5 Note the similarities to the search problem consideredlast week: the problems fold into each other

4 / 75

Theexploration-exploitation

trade-off

PantelisPipergiasAnalytis

Exploration-exploitationproblems

Themulti-armedbanditframework

Strategies

Contextualbandits

Results from areal worldexperiment

Conclusions

History of the problem

1 The [MAB] problem was formulated during the war, andefforts to solve it so sapped the energies and minds ofAllied scientists that the suggestion was made that theproblem be dropped over Germany, as the ultimateinstrument of intellectual sabotage. — Whittle (1980)

2 The first papers and strategies on the topic were writtenby Thompson (1933) and Robins (1952).

3 Bellman and Gittins provided backward looking andforward looking solutions to the problem.

4 Today the MAB framework is behind numerous algorithmsthat are used in the online world.

5 Note the similarities to the search problem consideredlast week: the problems fold into each other

4 / 75

Theexploration-exploitation

trade-off

PantelisPipergiasAnalytis

Exploration-exploitationproblems

Themulti-armedbanditframework

Strategies

Contextualbandits

Results from areal worldexperiment

Conclusions

History of the problem

1 The [MAB] problem was formulated during the war, andefforts to solve it so sapped the energies and minds ofAllied scientists that the suggestion was made that theproblem be dropped over Germany, as the ultimateinstrument of intellectual sabotage. — Whittle (1980)

2 The first papers and strategies on the topic were writtenby Thompson (1933) and Robins (1952).

3 Bellman and Gittins provided backward looking andforward looking solutions to the problem.

4 Today the MAB framework is behind numerous algorithmsthat are used in the online world.

5 Note the similarities to the search problem consideredlast week: the problems fold into each other

4 / 75

Theexploration-exploitation

trade-off

PantelisPipergiasAnalytis

Exploration-exploitationproblems

Themulti-armedbanditframework

Strategies

Contextualbandits

Results from areal worldexperiment

Conclusions

Domains where MABs have been applied

1 Developing new medicine—clinical trials.

2 One of the steam-engines for studying human (andanimal) learning.

3 A very general framework for autonomous AI decisionmaking. It is used as an alternative to A/B testing.

4 Currently used to allocate ads on the web. Companies likeCriteo rely heavily on this framework.

5 Used to decide which learning algorithm to use in aspecific context.

6 Used to model how companies might choose amongorganizational structures or technologies of unknown merit.

5 / 75

Theexploration-exploitation

trade-off

PantelisPipergiasAnalytis

Exploration-exploitationproblems

Themulti-armedbanditframework

Strategies

Contextualbandits

Results from areal worldexperiment

Conclusions

Domains where MABs have been applied

1 Developing new medicine—clinical trials.

2 One of the steam-engines for studying human (andanimal) learning.

3 A very general framework for autonomous AI decisionmaking. It is used as an alternative to A/B testing.

4 Currently used to allocate ads on the web. Companies likeCriteo rely heavily on this framework.

5 Used to decide which learning algorithm to use in aspecific context.

6 Used to model how companies might choose amongorganizational structures or technologies of unknown merit.

5 / 75

Theexploration-exploitation

trade-off

PantelisPipergiasAnalytis

Exploration-exploitationproblems

Themulti-armedbanditframework

Strategies

Contextualbandits

Results from areal worldexperiment

Conclusions

Domains where MABs have been applied

1 Developing new medicine—clinical trials.

2 One of the steam-engines for studying human (andanimal) learning.

3 A very general framework for autonomous AI decisionmaking. It is used as an alternative to A/B testing.

4 Currently used to allocate ads on the web. Companies likeCriteo rely heavily on this framework.

5 Used to decide which learning algorithm to use in aspecific context.

6 Used to model how companies might choose amongorganizational structures or technologies of unknown merit.

5 / 75

Theexploration-exploitation

trade-off

PantelisPipergiasAnalytis

Exploration-exploitationproblems

Themulti-armedbanditframework

Strategies

Contextualbandits

Results from areal worldexperiment

Conclusions

Domains where MABs have been applied

1 Developing new medicine—clinical trials.

2 One of the steam-engines for studying human (andanimal) learning.

3 A very general framework for autonomous AI decisionmaking. It is used as an alternative to A/B testing.

4 Currently used to allocate ads on the web. Companies likeCriteo rely heavily on this framework.

5 Used to decide which learning algorithm to use in aspecific context.

6 Used to model how companies might choose amongorganizational structures or technologies of unknown merit.

5 / 75

Theexploration-exploitation

trade-off

PantelisPipergiasAnalytis

Exploration-exploitationproblems

Themulti-armedbanditframework

Strategies

Contextualbandits

Results from areal worldexperiment

Conclusions

Domains where MABs have been applied

1 Developing new medicine—clinical trials.

2 One of the steam-engines for studying human (andanimal) learning.

3 A very general framework for autonomous AI decisionmaking. It is used as an alternative to A/B testing.

4 Currently used to allocate ads on the web. Companies likeCriteo rely heavily on this framework.

5 Used to decide which learning algorithm to use in aspecific context.

6 Used to model how companies might choose amongorganizational structures or technologies of unknown merit.

5 / 75

Theexploration-exploitation

trade-off

PantelisPipergiasAnalytis

Exploration-exploitationproblems

Themulti-armedbanditframework

Strategies

Contextualbandits

Results from areal worldexperiment

Conclusions

Domains where MABs have been applied

1 Developing new medicine—clinical trials.

2 One of the steam-engines for studying human (andanimal) learning.

3 A very general framework for autonomous AI decisionmaking. It is used as an alternative to A/B testing.

4 Currently used to allocate ads on the web. Companies likeCriteo rely heavily on this framework.

5 Used to decide which learning algorithm to use in aspecific context.

6 Used to model how companies might choose amongorganizational structures or technologies of unknown merit.

5 / 75

Theexploration-exploitation

trade-off

PantelisPipergiasAnalytis

Exploration-exploitationproblems

Themulti-armedbanditframework

Strategies

Contextualbandits

Results from areal worldexperiment

Conclusions

Different strategies for coping with the multi-armedbandit problem

Go optimal — not always possible and oftencomputational very expensive.

Go greedy — always try the best alternative.

Add some noise — randomize one in while (e-greedy).

When randomizing choose options with higher expectedreturn with a higher probability (softmax).

Probability matching — choose actions according to theirprobability of being the best.

Optimism in the face of uncertainty — prefer actions thatare more uncertain, as they may turn out being rely good.

6 / 75

Theexploration-exploitation

trade-off

PantelisPipergiasAnalytis

Exploration-exploitationproblems

Themulti-armedbanditframework

Strategies

Contextualbandits

Results from areal worldexperiment

Conclusions

Different strategies for coping with the multi-armedbandit problem

Go optimal — not always possible and oftencomputational very expensive.

Go greedy — always try the best alternative.

Add some noise — randomize one in while (e-greedy).

When randomizing choose options with higher expectedreturn with a higher probability (softmax).

Probability matching — choose actions according to theirprobability of being the best.

Optimism in the face of uncertainty — prefer actions thatare more uncertain, as they may turn out being rely good.

6 / 75

Theexploration-exploitation

trade-off

PantelisPipergiasAnalytis

Exploration-exploitationproblems

Themulti-armedbanditframework

Strategies

Contextualbandits

Results from areal worldexperiment

Conclusions

Different strategies for coping with the multi-armedbandit problem

Go optimal — not always possible and oftencomputational very expensive.

Go greedy — always try the best alternative.

Add some noise — randomize one in while (e-greedy).

When randomizing choose options with higher expectedreturn with a higher probability (softmax).

Probability matching — choose actions according to theirprobability of being the best.

Optimism in the face of uncertainty — prefer actions thatare more uncertain, as they may turn out being rely good.

6 / 75

Theexploration-exploitation

trade-off

PantelisPipergiasAnalytis

Exploration-exploitationproblems

Themulti-armedbanditframework

Strategies

Contextualbandits

Results from areal worldexperiment

Conclusions

Different strategies for coping with the multi-armedbandit problem

Go optimal — not always possible and oftencomputational very expensive.

Go greedy — always try the best alternative.

Add some noise — randomize one in while (e-greedy).

When randomizing choose options with higher expectedreturn with a higher probability (softmax).

Probability matching — choose actions according to theirprobability of being the best.

Optimism in the face of uncertainty — prefer actions thatare more uncertain, as they may turn out being rely good.

6 / 75

Theexploration-exploitation

trade-off

PantelisPipergiasAnalytis

Exploration-exploitationproblems

Themulti-armedbanditframework

Strategies

Contextualbandits

Results from areal worldexperiment

Conclusions

Different strategies for coping with the multi-armedbandit problem

Go optimal — not always possible and oftencomputational very expensive.

Go greedy — always try the best alternative.

Add some noise — randomize one in while (e-greedy).

When randomizing choose options with higher expectedreturn with a higher probability (softmax).

Probability matching — choose actions according to theirprobability of being the best.

Optimism in the face of uncertainty — prefer actions thatare more uncertain, as they may turn out being rely good.

6 / 75

Theexploration-exploitation

trade-off

PantelisPipergiasAnalytis

Exploration-exploitationproblems

Themulti-armedbanditframework

Strategies

Contextualbandits

Results from areal worldexperiment

Conclusions

Different strategies for coping with the multi-armedbandit problem

Go optimal — not always possible and oftencomputational very expensive.

Go greedy — always try the best alternative.

Add some noise — randomize one in while (e-greedy).

When randomizing choose options with higher expectedreturn with a higher probability (softmax).

Probability matching — choose actions according to theirprobability of being the best.

Optimism in the face of uncertainty — prefer actions thatare more uncertain, as they may turn out being rely good.

6 / 75

Theexploration-exploitation

trade-off

PantelisPipergiasAnalytis

Exploration-exploitationproblems

Themulti-armedbanditframework

Strategies

Contextualbandits

Results from areal worldexperiment

Conclusions

The Gittins index (Christian and Griffiths cpt. 2)

Possible to calculate for Bernoulli bandits with stablediscounting of future trials.

7 / 75

Theexploration-exploitation

trade-off

PantelisPipergiasAnalytis

Exploration-exploitationproblems

Themulti-armedbanditframework

Strategies

Contextualbandits

Results from areal worldexperiment

Conclusions

A simple example (Sutton and Barto, cpt. 2)

8 / 75

Theexploration-exploitation

trade-off

PantelisPipergiasAnalytis

Exploration-exploitationproblems

Themulti-armedbanditframework

Strategies

Contextualbandits

Results from areal worldexperiment

Conclusions

Performance of the e-greedy algorithm

9 / 75

Theexploration-exploitation

trade-off

PantelisPipergiasAnalytis

Exploration-exploitationproblems

Themulti-armedbanditframework

Strategies

Contextualbandits

Results from areal worldexperiment

Conclusions

Starting optimistically

10 / 75

Theexploration-exploitation

trade-off

PantelisPipergiasAnalytis

Exploration-exploitationproblems

Themulti-armedbanditframework

Strategies

Contextualbandits

Results from areal worldexperiment

Conclusions

Discussion: A/B testing andexploration-exploitation

11 / 75

Theexploration-exploitation

trade-off

PantelisPipergiasAnalytis

Exploration-exploitationproblems

Themulti-armedbanditframework

Strategies

Contextualbandits

Results from areal worldexperiment

Conclusions

The softmax rule

Biases exploration towards the more promising actions.

The softmax rule grades probabilities according to theirselected values.

P(C (t) = j) =exp(θEj(t))∑K

k=1 exp(θEk(t))

where θ is a temperature controlling how biased thealgorithm will be.

12 / 75

Theexploration-exploitation

trade-off

PantelisPipergiasAnalytis

Exploration-exploitationproblems

Themulti-armedbanditframework

Strategies

Contextualbandits

Results from areal worldexperiment

Conclusions

The softmax rule

Biases exploration towards the more promising actions.

The softmax rule grades probabilities according to theirselected values.

P(C (t) = j) =exp(θEj(t))∑K

k=1 exp(θEk(t))

where θ is a temperature controlling how biased thealgorithm will be.

12 / 75

Theexploration-exploitation

trade-off

PantelisPipergiasAnalytis

Exploration-exploitationproblems

Themulti-armedbanditframework

Strategies

Contextualbandits

Results from areal worldexperiment

Conclusions

The softmax rule

Biases exploration towards the more promising actions.

The softmax rule grades probabilities according to theirselected values.

P(C (t) = j) =exp(θEj(t))∑K

k=1 exp(θEk(t))

where θ is a temperature controlling how biased thealgorithm will be.

12 / 75

Theexploration-exploitation

trade-off

PantelisPipergiasAnalytis

Exploration-exploitationproblems

Themulti-armedbanditframework

Strategies

Contextualbandits

Results from areal worldexperiment

Conclusions

The softmax rule

Biases exploration towards the more promising actions.

The softmax rule grades probabilities according to theirselected values.

P(C (t) = j) =exp(θEj(t))∑K

k=1 exp(θEk(t))

where θ is a temperature controlling how biased thealgorithm will be.

12 / 75

Theexploration-exploitation

trade-off

PantelisPipergiasAnalytis

Exploration-exploitationproblems

Themulti-armedbanditframework

Strategies

Contextualbandits

Results from areal worldexperiment

Conclusions

Optimism in the face of uncertainty and the upperconfidence bound (UCB)

The more uncertain you are about the value of an optionthe more important it is to explore.That option could turn out to be really good and in thelong-term improve your overall utility.UCB: P(C = i) ∝ exp(θmi + α

√vari )

13 / 75

Theexploration-exploitation

trade-off

PantelisPipergiasAnalytis

Exploration-exploitationproblems

Themulti-armedbanditframework

Strategies

Contextualbandits

Results from areal worldexperiment

Conclusions

UCB against e-greedy

14 / 75

Theexploration-exploitation

trade-off

PantelisPipergiasAnalytis

Exploration-exploitationproblems

Themulti-armedbanditframework

Strategies

Contextualbandits

Results from areal worldexperiment

Conclusions

Probability matching, changing environments andThompson sampling

Probability matching suggests sampling alternativesaccording to their rewards or their probability of being thebest.Thompson sampling is an implementation of theprobability matching principle.

15 / 75

Theexploration-exploitation

trade-off

PantelisPipergiasAnalytis

Exploration-exploitationproblems

Themulti-armedbanditframework

Strategies

Contextualbandits

Results from areal worldexperiment

Conclusions

Collective exploration

The Roger’s paradox — produce or scrounge?

The social learning tournament — Rendell et al. (2010)

Counter-intuitive more or less effects — Toyokawa et al.(2014)

16 / 75

Theexploration-exploitation

trade-off

PantelisPipergiasAnalytis

Exploration-exploitationproblems

Themulti-armedbanditframework

Strategies

Contextualbandits

Results from areal worldexperiment

Conclusions

Collective exploration

The Roger’s paradox — produce or scrounge?

The social learning tournament — Rendell et al. (2010)

Counter-intuitive more or less effects — Toyokawa et al.(2014)

16 / 75

Theexploration-exploitation

trade-off

PantelisPipergiasAnalytis

Exploration-exploitationproblems

Themulti-armedbanditframework

Strategies

Contextualbandits

Results from areal worldexperiment

Conclusions

Collective exploration

The Roger’s paradox — produce or scrounge?

The social learning tournament — Rendell et al. (2010)

Counter-intuitive more or less effects — Toyokawa et al.(2014)

16 / 75

Theexploration-exploitation

trade-off

PantelisPipergiasAnalytis

Exploration-exploitationproblems

Themulti-armedbanditframework

Strategies

Contextualbandits

Results from areal worldexperiment

Conclusions

Collective exploration

The Roger’s paradox — produce or scrounge?

The social learning tournament — Rendell et al. (2010)

Counter-intuitive more or less effects — Toyokawa et al.(2014)

17 / 75

Theexploration-exploitation

trade-off

PantelisPipergiasAnalytis

Exploration-exploitationproblems

Themulti-armedbanditframework

Strategies

Contextualbandits

Results from areal worldexperiment

Conclusions

The typical bandit setting is like blind tasting...

18 / 75

Theexploration-exploitation

trade-off

PantelisPipergiasAnalytis

Exploration-exploitationproblems

Themulti-armedbanditframework

Strategies

Contextualbandits

Results from areal worldexperiment

Conclusions

My grandma’s problem: Choosing the best place toswim

19 / 75

Theexploration-exploitation

trade-off

PantelisPipergiasAnalytis

Exploration-exploitationproblems

Themulti-armedbanditframework

Strategies

Contextualbandits

Results from areal worldexperiment

Conclusions

The machine learner’s problem

Li, L., Chu, W., Langford, J., Schapire, R. E. (2010,April). A contextual-bandit approach to personalizednews article recommendation. In Proceedings of the 19thinternational conference on World Wide Web.

20 / 75

Theexploration-exploitation

trade-off

PantelisPipergiasAnalytis

Exploration-exploitationproblems

Themulti-armedbanditframework

Strategies

Contextualbandits

Results from areal worldexperiment

Conclusions

A contextual bandit experiment

21 / 75

Theexploration-exploitation

trade-off

PantelisPipergiasAnalytis

Exploration-exploitationproblems

Themulti-armedbanditframework

Strategies

Contextualbandits

Results from areal worldexperiment

Conclusions

A contextual bandit experiment: Results

22 / 75

Theexploration-exploitation

trade-off

PantelisPipergiasAnalytis

Exploration-exploitationproblems

Themulti-armedbanditframework

Strategies

Contextualbandits

Results from areal worldexperiment

Conclusions

Contextual multi-armed bandit (CMAB) problem

Option 1

N(f (·), σ1)N(w1x1 + w2x2, σ)

N(µ1, σ1)

Option 2

N(f (·), σ2)N(w1x1 + w2x2, σ)

N(µ2, σ2)

23 / 75

Theexploration-exploitation

trade-off

PantelisPipergiasAnalytis

Exploration-exploitationproblems

Themulti-armedbanditframework

Strategies

Contextualbandits

Results from areal worldexperiment

Conclusions

Realistic decision problem...

24 / 75

Theexploration-exploitation

trade-off

PantelisPipergiasAnalytis

Exploration-exploitationproblems

Themulti-armedbanditframework

Strategies

Contextualbandits

Results from areal worldexperiment

Conclusions

Realistic decision problem...

25 / 75

Theexploration-exploitation

trade-off

PantelisPipergiasAnalytis

Exploration-exploitationproblems

Themulti-armedbanditframework

Strategies

Contextualbandits

Results from areal worldexperiment

Conclusions

Motivation

Why is the CMAB problem interesting?

1 Captures the important characteristics of decisions in thewild better.

2 We can study how function learning interacts with decisionmaking, how people deal with novelty, transfer of learning.

3 TD(λ) & curse of dimensionality - function learning as asolution. These problems are notoriously hard to solveusing optimization techniques.

4 There is no realistic framework within we can study howpeople their preferences. CMAB might provide us withone.

26 / 75

Theexploration-exploitation

trade-off

PantelisPipergiasAnalytis

Exploration-exploitationproblems

Themulti-armedbanditframework

Strategies

Contextualbandits

Results from areal worldexperiment

Conclusions

Motivation

Why is the CMAB problem interesting?

1 Captures the important characteristics of decisions in thewild better.

2 We can study how function learning interacts with decisionmaking, how people deal with novelty, transfer of learning.

3 TD(λ) & curse of dimensionality - function learning as asolution. These problems are notoriously hard to solveusing optimization techniques.

4 There is no realistic framework within we can study howpeople their preferences. CMAB might provide us withone.

26 / 75

Theexploration-exploitation

trade-off

PantelisPipergiasAnalytis

Exploration-exploitationproblems

Themulti-armedbanditframework

Strategies

Contextualbandits

Results from areal worldexperiment

Conclusions

Motivation

Why is the CMAB problem interesting?

1 Captures the important characteristics of decisions in thewild better.

2 We can study how function learning interacts with decisionmaking, how people deal with novelty, transfer of learning.

3 TD(λ) & curse of dimensionality - function learning as asolution. These problems are notoriously hard to solveusing optimization techniques.

4 There is no realistic framework within we can study howpeople their preferences. CMAB might provide us withone.

26 / 75

Theexploration-exploitation

trade-off

PantelisPipergiasAnalytis

Exploration-exploitationproblems

Themulti-armedbanditframework

Strategies

Contextualbandits

Results from areal worldexperiment

Conclusions

Motivation

Why is the CMAB problem interesting?

1 Captures the important characteristics of decisions in thewild better.

2 We can study how function learning interacts with decisionmaking, how people deal with novelty, transfer of learning.

3 TD(λ) & curse of dimensionality - function learning as asolution. These problems are notoriously hard to solveusing optimization techniques.

4 There is no realistic framework within we can study howpeople their preferences. CMAB might provide us withone.

26 / 75

Theexploration-exploitation

trade-off

PantelisPipergiasAnalytis

Exploration-exploitationproblems

Themulti-armedbanditframework

Strategies

Contextualbandits

Results from areal worldexperiment

Conclusions

Motivation

Why is the CMAB problem interesting?

1 Captures the important characteristics of decisions in thewild better.

2 We can study how function learning interacts with decisionmaking, how people deal with novelty, transfer of learning.

3 TD(λ) & curse of dimensionality - function learning as asolution. These problems are notoriously hard to solveusing optimization techniques.

4 There is no realistic framework within we can study howpeople their preferences. CMAB might provide us withone.

26 / 75

Theexploration-exploitation

trade-off

PantelisPipergiasAnalytis

Exploration-exploitationproblems

Themulti-armedbanditframework

Strategies

Contextualbandits

Results from areal worldexperiment

Conclusions

CMAB task

27 / 75

Theexploration-exploitation

trade-off

PantelisPipergiasAnalytis

Exploration-exploitationproblems

Themulti-armedbanditframework

Strategies

Contextualbandits

Results from areal worldexperiment

Conclusions

MAB task

28 / 75

Theexploration-exploitation

trade-off

PantelisPipergiasAnalytis

Exploration-exploitationproblems

Themulti-armedbanditframework

Strategies

Contextualbandits

Results from areal worldexperiment

Conclusions

One-shot choices in the test phase

Three alternatives:

Dominating - highest function value.

Neutral - middle function value.

Dominated - lowest function value.

29 / 75

Theexploration-exploitation

trade-off

PantelisPipergiasAnalytis

Exploration-exploitationproblems

Themulti-armedbanditframework

Strategies

Contextualbandits

Results from areal worldexperiment

Conclusions

Experimental Design

Training phase

Between subject design – CMAB or MAB

Contextual multi-armed bandit (CMAB) task – twoinformative features are visually displayed

Classic multi-armed bandit (MAB) task – control group,features are not visible

20 alternatives, 100 trials

Test phase

Designed to test the functional knowledge.

One shot choices, no outcome feedback.

3 arms in 70 trials.

30 / 75

Theexploration-exploitation

trade-off

PantelisPipergiasAnalytis

Exploration-exploitationproblems

Themulti-armedbanditframework

Strategies

Contextualbandits

Results from areal worldexperiment

Conclusions

Gaussian process (GP) based “optimal” solutions

Goal: simultaneously learn and optimize unknown function.

y = f (x) + ε, ε ∼ N(0, σ2)

GP based function learning process

f (x) ∼ GP(m(x),K (x, x′))

K (x, x′) = σ2f exp(− (x−x′)2

2l2)

Two versions of the choice process

1 Upper confidence bound (GP-UCB): argmaximi + 2√vari

2 Thompson sampling (GP-Th): Draw from p(θ|D,M) foreach arm, take the max.

31 / 75

Theexploration-exploitation

trade-off

PantelisPipergiasAnalytis

Exploration-exploitationproblems

Themulti-armedbanditframework

Strategies

Contextualbandits

Results from areal worldexperiment

Conclusions

Gaussian process (GP) based “optimal” solutions

Goal: simultaneously learn and optimize unknown function.

y = f (x) + ε, ε ∼ N(0, σ2)

GP based function learning process

f (x) ∼ GP(m(x),K (x, x′))

K (x, x′) = σ2f exp(− (x−x′)2

2l2)

Two versions of the choice process

1 Upper confidence bound (GP-UCB): argmaximi + 2√vari

2 Thompson sampling (GP-Th): Draw from p(θ|D,M) foreach arm, take the max.

31 / 75

Theexploration-exploitation

trade-off

PantelisPipergiasAnalytis

Exploration-exploitationproblems

Themulti-armedbanditframework

Strategies

Contextualbandits

Results from areal worldexperiment

Conclusions

Gaussian process (GP) based “optimal” solutions

Goal: simultaneously learn and optimize unknown function.

y = f (x) + ε, ε ∼ N(0, σ2)

GP based function learning process

f (x) ∼ GP(m(x),K (x, x′))

K (x, x′) = σ2f exp(− (x−x′)2

2l2)

Two versions of the choice process

1 Upper confidence bound (GP-UCB): argmaximi + 2√vari

2 Thompson sampling (GP-Th): Draw from p(θ|D,M) foreach arm, take the max.

31 / 75

Theexploration-exploitation

trade-off

PantelisPipergiasAnalytis

Exploration-exploitationproblems

Themulti-armedbanditframework

Strategies

Contextualbandits

Results from areal worldexperiment

Conclusions

GP prior, 1D example

−2

0

2

0.00 0.25 0.50 0.75 1.00input, x

outp

ut, f

(x)

Prior, Squared exponential kernel, l=1

−2

0

2

0.00 0.25 0.50 0.75 1.00input, x

outp

ut, f

(x)

GP−UCB, trial 3

−2

0

2

0.00 0.25 0.50 0.75 1.00input, x

outp

ut, f

(x)

GP−UCB, trial 5

−2

0

2

0.00 0.25 0.50 0.75 1.00input, x

outp

ut, f

(x)

GP−UCB, trial 20

32 / 75

Theexploration-exploitation

trade-off

PantelisPipergiasAnalytis

Exploration-exploitationproblems

Themulti-armedbanditframework

Strategies

Contextualbandits

Results from areal worldexperiment

Conclusions

GP-Thompson, 1D example

−2

0

2

0.00 0.25 0.50 0.75 1.00input, x

outp

ut, f

(x)

GP−Thompson, trial 2

−2

0

2

0.00 0.25 0.50 0.75 1.00input, x

outp

ut, f

(x)

GP−Thompson, trial 3

−2

0

2

0.00 0.25 0.50 0.75 1.00input, x

outp

ut, f

(x)

GP−Thompson, trial 20

−2

0

2

0.00 0.25 0.50 0.75 1.00input, x

outp

ut, f

(x)

GP−Thompson, trial 100

33 / 75

Theexploration-exploitation

trade-off

PantelisPipergiasAnalytis

Exploration-exploitationproblems

Themulti-armedbanditframework

Strategies

Contextualbandits

Results from areal worldexperiment

Conclusions

How much do people rely on knowledge of therelationships between features and alternative value’swhen making decisions?

Can we model people’s behavior using traditional machinelearning models?

How priors about functional relationships affect thedecision making?

Do people explore the choice set strategically, to learnthe relationships?

34 / 75

Theexploration-exploitation

trade-off

PantelisPipergiasAnalytis

Exploration-exploitationproblems

Themulti-armedbanditframework

Strategies

Contextualbandits

Results from areal worldexperiment

Conclusions

Experiment 1 – Positive linear function

Experiment 1a – Amazon Turk

Feature values x drawn from U(0.1, 0.9)

For each arm j in trial t, the payoffs Rj(t) were computedas:

Rj(t) = 2× x1,j + 1× x2,j + εj(t).

εj(t) drawn independently for each arm in every trial, fromN(0, 0.25).

Task was to maximize the cumulative reward.

186 participants – monetary payoffs.

Experiment 1b – lab replication

Weights and noise rescaled: w1 = 20, w2 = 10, N(0, 2.5).

75 UPF lab participants – monetary payoffs.

35 / 75

Theexploration-exploitation

trade-off

PantelisPipergiasAnalytis

Exploration-exploitationproblems

Themulti-armedbanditframework

Strategies

Contextualbandits

Results from areal worldexperiment

Conclusions

Experiment 1 – Positive linear function

Experiment 1a – Amazon Turk

Feature values x drawn from U(0.1, 0.9)

For each arm j in trial t, the payoffs Rj(t) were computedas:

Rj(t) = 2× x1,j + 1× x2,j + εj(t).

εj(t) drawn independently for each arm in every trial, fromN(0, 0.25).

Task was to maximize the cumulative reward.

186 participants – monetary payoffs.

Experiment 1b – lab replication

Weights and noise rescaled: w1 = 20, w2 = 10, N(0, 2.5).

75 UPF lab participants – monetary payoffs.35 / 75

Theexploration-exploitation

trade-off

PantelisPipergiasAnalytis

Exploration-exploitationproblems

Themulti-armedbanditframework

Strategies

Contextualbandits

Results from areal worldexperiment

Conclusions

Mean choice rank - Exp 1a

Random performance

1

2

3

4

5

6

7

8

9

10

11

12

13

1 2 3 4 5Block

Mea

n ra

nk o

f the

cho

sen

alte

rnat

ive

MABCMABGP−UCB

Individual 1 Individual 236 / 75

Theexploration-exploitation

trade-off

PantelisPipergiasAnalytis

Exploration-exploitationproblems

Themulti-armedbanditframework

Strategies

Contextualbandits

Results from areal worldexperiment

Conclusions

Mean choice rank - Exp 1a

Random performance

1

2

3

4

5

6

7

8

9

10

11

12

13

1 2 3 4 5Block

Mea

n ra

nk o

f the

cho

sen

alte

rnat

ive

MABCMABGP−UCB

Mean choice rank - Exp 1b37 / 75

Theexploration-exploitation

trade-off

PantelisPipergiasAnalytis

Exploration-exploitationproblems

Themulti-armedbanditframework

Strategies

Contextualbandits

Results from areal worldexperiment

Conclusions

One-shot choices in the test phase

Three alternatives:

Dominating - highest function value.

Neutral - middle function value.

Dominated - lowest function value.

38 / 75

Theexploration-exploitation

trade-off

PantelisPipergiasAnalytis

Exploration-exploitationproblems

Themulti-armedbanditframework

Strategies

Contextualbandits

Results from areal worldexperiment

Conclusions

One-shot choices in the test phase

CMABn

Diff/Extra

CMABn

Diff/Inter

CMABn

Easy/Extra

CMABn

Easy/Inter

CMABn

Weight test

0.00

0.25

0.50

0.75

1.00

1 2 3 1 2 3 1 2 3 1 2 3 1 2 3Rank of the chosen alternative

Mea

n pr

opor

tion

of c

hoic

es

39 / 75

Theexploration-exploitation

trade-off

PantelisPipergiasAnalytis

Exploration-exploitationproblems

Themulti-armedbanditframework

Strategies

Contextualbandits

Results from areal worldexperiment

Conclusions

One-shot choices in the test phase

CMABn

Diff/Extra

CMABn

Diff/Inter

CMABn

Easy/Extra

CMABn

Easy/Inter

CMABn

Weight test

0.00

0.25

0.50

0.75

1.00

1 2 3 1 2 3 1 2 3 1 2 3 1 2 3Rank of the chosen alternative

Mea

n pr

opor

tion

of c

hoic

es

One-shot choices – Lab replication

40 / 75

Theexploration-exploitation

trade-off

PantelisPipergiasAnalytis

Exploration-exploitationproblems

Themulti-armedbanditframework

Strategies

Contextualbandits

Results from areal worldexperiment

Conclusions

Exploration in the feature space

MAB CMABn

(0.1,0.3]

(0.3,0.5]

(0.5,0.7]

(0.7,0.9]

(0.1

,0.3

]

(0.3

,0.5

]

(0.5

,0.7

]

(0.7

,0.9

]

(0.1

,0.3

]

(0.3

,0.5

]

(0.5

,0.7

]

(0.7

,0.9

]

Feature 1

Feat

ure

2

0.1

0.2

0.3Proportion

41 / 75

Theexploration-exploitation

trade-off

PantelisPipergiasAnalytis

Exploration-exploitationproblems

Themulti-armedbanditframework

Strategies

Contextualbandits

Results from areal worldexperiment

Conclusions

Exploration in the feature space – First 10 trials

MAB CMABn

(0.1,0.3]

(0.3,0.5]

(0.5,0.7]

(0.7,0.9]

(0.1

,0.3

]

(0.3

,0.5

]

(0.5

,0.7

]

(0.7

,0.9

]

(0.1

,0.3

]

(0.3

,0.5

]

(0.5

,0.7

]

(0.7

,0.9

]

Feature 1

Feat

ure

2

0.040.060.080.100.12

Proportion

Lab replication Clusters: Exploration

42 / 75

Theexploration-exploitation

trade-off

PantelisPipergiasAnalytis

Exploration-exploitationproblems

Themulti-armedbanditframework

Strategies

Contextualbandits

Results from areal worldexperiment

Conclusions

Exploration in the feature space – All trials

MAB CMABn

(0.1,0.3]

(0.3,0.5]

(0.5,0.7]

(0.7,0.9]

(0.1

,0.3

]

(0.3

,0.5

]

(0.5

,0.7

]

(0.7

,0.9

]

(0.1

,0.3

]

(0.3

,0.5

]

(0.5

,0.7

]

(0.7

,0.9

]

Feature 1

Feat

ure

2

0.1

0.2

0.3Proportion

Exp 1b - Lab replication

43 / 75

Theexploration-exploitation

trade-off

PantelisPipergiasAnalytis

Exploration-exploitationproblems

Themulti-armedbanditframework

Strategies

Contextualbandits

Results from areal worldexperiment

Conclusions

Inter-individual differences: Function-basedand naive learners

44 / 75

Theexploration-exploitation

trade-off

PantelisPipergiasAnalytis

Exploration-exploitationproblems

Themulti-armedbanditframework

Strategies

Contextualbandits

Results from areal worldexperiment

Conclusions

Clustering according to the test phase performance

1

Diff/Extra

1

Diff/Inter

1

Easy/Extra

1

Easy/Inter

1

Weight test

2

Diff/Extra

2

Diff/Inter

2

Easy/Extra

2

Easy/Inter

2

Weight test

0.00

0.25

0.50

0.75

1.00

0.00

0.25

0.50

0.75

1.00

1 2 3 1 2 3 1 2 3 1 2 3 1 2 3Rank of the chosen alternative

Mea

n pr

opor

tion

of c

hoic

es

45 / 75

Theexploration-exploitation

trade-off

PantelisPipergiasAnalytis

Exploration-exploitationproblems

Themulti-armedbanditframework

Strategies

Contextualbandits

Results from areal worldexperiment

Conclusions

Clusters: Performance in the CMAB task

N1 = 43N2 = 53

Rtr = 7Rtr = 4.59

Rte = 1.94Rte = 1.24

CMABn

1

2

3

4

5

6

7

8

9

10

11

12

13

1 2 3 4 5Block

Mea

n ra

nk o

f the

cho

sen

alte

rnat

ive

aaaaaa

12

46 / 75

Theexploration-exploitation

trade-off

PantelisPipergiasAnalytis

Exploration-exploitationproblems

Themulti-armedbanditframework

Strategies

Contextualbandits

Results from areal worldexperiment

Conclusions

Clusters: Feature space, first 10 trials

CMABn

1

CMABn

2

(0.1,0.3]

(0.3,0.5]

(0.5,0.7]

(0.7,0.9]

(0.1

,0.3

]

(0.3

,0.5

]

(0.5

,0.7

]

(0.7

,0.9

]

(0.1

,0.3

]

(0.3

,0.5

]

(0.5

,0.7

]

(0.7

,0.9

]

Feature 1

Feat

ure

2

0.060.090.120.15

Proportion

Exploration in the MAB condition

47 / 75

Theexploration-exploitation

trade-off

PantelisPipergiasAnalytis

Exploration-exploitationproblems

Themulti-armedbanditframework

Strategies

Contextualbandits

Results from areal worldexperiment

Conclusions

How much do people rely on knowledge of therelationships between features and alternative value’swhen making decisions?

Can we model people’s behavior using traditional machinelearning models?

How priors about functional relationships affect thedecision making?

Do people explore the choice set strategically, to learnthe relationships?

48 / 75

Theexploration-exploitation

trade-off

PantelisPipergiasAnalytis

Exploration-exploitationproblems

Themulti-armedbanditframework

Strategies

Contextualbandits

Results from areal worldexperiment

Conclusions

Modeling user behavior

Learning: We model participants as function learners(GP) or as tracing mean rewards (BMT)

1 Gaussian processes (GP) function learning model:

f (x) ∼ GP(m(x),K (x, x′)), K (x, x′) = σ2f exp(− (x−x′)2

2l2 )2 Bayesian mean reward tracing (BMT)

Choices: Participants either use uncertainty in balancingthe exploration-exploitation (UCB) or not (SM).

1 Upper confidence bound (UCB):P(C = i) ∝ exp(θmi + α

√vari )

2 Softmax (SM): P(C = i) ∝ exp(θmi )

49 / 75

Theexploration-exploitation

trade-off

PantelisPipergiasAnalytis

Exploration-exploitationproblems

Themulti-armedbanditframework

Strategies

Contextualbandits

Results from areal worldexperiment

Conclusions

Modeling user behavior

Model BICw N # C1 C2

BMT-SM .54 (.38) 19 4 .48 (12) .72 (7)BMT-UCB .05 (.55) 0 5 .04 (0) .07 (0)GP-SM .27 (.38) 12 4 .34 (11) .09 (1)GP-UCB .02 (.04) 0 5 .03 (0) .01 (0)

RCM .11 (.32) 4 0 .11 (3) .11 (1)

There is evidence for GP models, especially for participants thatknow function well (according to the test task). Models withUCB perform poorly.

50 / 75

Theexploration-exploitation

trade-off

PantelisPipergiasAnalytis

Exploration-exploitationproblems

Themulti-armedbanditframework

Strategies

Contextualbandits

Results from areal worldexperiment

Conclusions

How much do people rely on knowledge of therelationships between features and alternative value’swhen making decisions?

Can we model people’s behavior using traditional machinelearning models?

How priors about functional relationships affect thedecision making?

Do people explore the choice set strategically, to learnthe relationships?

51 / 75

Theexploration-exploitation

trade-off

PantelisPipergiasAnalytis

Exploration-exploitationproblems

Themulti-armedbanditframework

Strategies

Contextualbandits

Results from areal worldexperiment

Conclusions

Experiment 1c – Quadratic and mixed linearfunction

Training phase

2x2 between subject design: Type of task (CMAB, MAB)and Type of function (Quadratic, Mixed)

Quadratic function:1 + 60(x1 − .02)2 + 60(x1 − .02)2 + 30x1x2, N(0, 2.5)

Mixed linear function: w1 = 40, w2 = −30, N(0, 2.5)

376 participants – Amazon Turk – monetary payoffs.

Test phase

Test items for mixed linear function are the same as forthe positive linear one

Special items for the quadratic function, testing whetherpeople detected the nonlinear nature of the relationship.

Illustration

52 / 75

Theexploration-exploitation

trade-off

PantelisPipergiasAnalytis

Exploration-exploitationproblems

Themulti-armedbanditframework

Strategies

Contextualbandits

Results from areal worldexperiment

Conclusions

Exploration in the feature space, first 10 trials

MAB mixed CMAB mixed

MAB quadratic CMAB quadratic

(0.1,0.3]

(0.3,0.5]

(0.5,0.7]

(0.7,0.9]

(0.1,0.3]

(0.3,0.5]

(0.5,0.7]

(0.7,0.9]

(0.1

,0.3

]

(0.3

,0.5

]

(0.5

,0.7

]

(0.7

,0.9

]

(0.1

,0.3

]

(0.3

,0.5

]

(0.5

,0.7

]

(0.7

,0.9

]

Feature 1

Feat

ure

2

0.0500.0750.100

Proportion

Mean choice rank One-shot choices

53 / 75

Theexploration-exploitation

trade-off

PantelisPipergiasAnalytis

Exploration-exploitationproblems

Themulti-armedbanditframework

Strategies

Contextualbandits

Results from areal worldexperiment

Conclusions

Exploration in the feature space, all trials

MAB mixed CMAB mixed

MAB quadratic CMAB quadratic

(0.1,0.3]

(0.3,0.5]

(0.5,0.7]

(0.7,0.9]

(0.1,0.3]

(0.3,0.5]

(0.5,0.7]

(0.7,0.9]

(0.1

,0.3

]

(0.3

,0.5

]

(0.5

,0.7

]

(0.7

,0.9

]

(0.1

,0.3

]

(0.3

,0.5

]

(0.5

,0.7

]

(0.7

,0.9

]

Feature 1

Feat

ure

2

0.1

0.2

0.3Proportion

Mean choice rank One-shot choices

54 / 75

Theexploration-exploitation

trade-off

PantelisPipergiasAnalytis

Exploration-exploitationproblems

Themulti-armedbanditframework

Strategies

Contextualbandits

Results from areal worldexperiment

Conclusions

How much do people rely on knowledge of therelationships between features and alternative value’swhen making decisions?

Can we model people’s behavior using traditional machinelearning models?

How priors about functional relationships affect thedecision making?

Do people explore the choice set strategically, to learnthe relationships?

55 / 75

Theexploration-exploitation

trade-off

PantelisPipergiasAnalytis

Exploration-exploitationproblems

Themulti-armedbanditframework

Strategies

Contextualbandits

Results from areal worldexperiment

Conclusions

Experiment 2 – Function learning pretraining

Exploration to learn the function should depend on...

Uncertainty about the function.Type of function.Horizon.Expecting need for generalization.

Training phase

Mixed design – Two between factors: Type of function(Positive linear, Quadratic) x Horizon (100 or 30 trialsCMAB phase) and within factor (With or without functionlearning phase)Function learning task – 100 trials with single alternative,same two features and function, accuracy incentivized.Same positive linear and quadratic functions as before, butalternatives now includes randomly drawn intercepts!425 participants – Amazon Turk – monetary payoffs.

56 / 75

Theexploration-exploitation

trade-off

PantelisPipergiasAnalytis

Exploration-exploitationproblems

Themulti-armedbanditframework

Strategies

Contextualbandits

Results from areal worldexperiment

Conclusions

Experiment 2 – Function learning pretraining

Exploration to learn the function should depend on...

Uncertainty about the function.

Type of function.Horizon.Expecting need for generalization.

Training phase

Mixed design – Two between factors: Type of function(Positive linear, Quadratic) x Horizon (100 or 30 trialsCMAB phase) and within factor (With or without functionlearning phase)Function learning task – 100 trials with single alternative,same two features and function, accuracy incentivized.Same positive linear and quadratic functions as before, butalternatives now includes randomly drawn intercepts!425 participants – Amazon Turk – monetary payoffs.

56 / 75

Theexploration-exploitation

trade-off

PantelisPipergiasAnalytis

Exploration-exploitationproblems

Themulti-armedbanditframework

Strategies

Contextualbandits

Results from areal worldexperiment

Conclusions

Experiment 2 – Function learning pretraining

Exploration to learn the function should depend on...

Uncertainty about the function.Type of function.

Horizon.Expecting need for generalization.

Training phase

Mixed design – Two between factors: Type of function(Positive linear, Quadratic) x Horizon (100 or 30 trialsCMAB phase) and within factor (With or without functionlearning phase)Function learning task – 100 trials with single alternative,same two features and function, accuracy incentivized.Same positive linear and quadratic functions as before, butalternatives now includes randomly drawn intercepts!425 participants – Amazon Turk – monetary payoffs.

56 / 75

Theexploration-exploitation

trade-off

PantelisPipergiasAnalytis

Exploration-exploitationproblems

Themulti-armedbanditframework

Strategies

Contextualbandits

Results from areal worldexperiment

Conclusions

Experiment 2 – Function learning pretraining

Exploration to learn the function should depend on...

Uncertainty about the function.Type of function.Horizon.

Expecting need for generalization.

Training phase

Mixed design – Two between factors: Type of function(Positive linear, Quadratic) x Horizon (100 or 30 trialsCMAB phase) and within factor (With or without functionlearning phase)Function learning task – 100 trials with single alternative,same two features and function, accuracy incentivized.Same positive linear and quadratic functions as before, butalternatives now includes randomly drawn intercepts!425 participants – Amazon Turk – monetary payoffs.

56 / 75

Theexploration-exploitation

trade-off

PantelisPipergiasAnalytis

Exploration-exploitationproblems

Themulti-armedbanditframework

Strategies

Contextualbandits

Results from areal worldexperiment

Conclusions

Experiment 2 – Function learning pretraining

Exploration to learn the function should depend on...

Uncertainty about the function.Type of function.Horizon.Expecting need for generalization.

Training phase

Mixed design – Two between factors: Type of function(Positive linear, Quadratic) x Horizon (100 or 30 trialsCMAB phase) and within factor (With or without functionlearning phase)Function learning task – 100 trials with single alternative,same two features and function, accuracy incentivized.Same positive linear and quadratic functions as before, butalternatives now includes randomly drawn intercepts!425 participants – Amazon Turk – monetary payoffs.

56 / 75

Theexploration-exploitation

trade-off

PantelisPipergiasAnalytis

Exploration-exploitationproblems

Themulti-armedbanditframework

Strategies

Contextualbandits

Results from areal worldexperiment

Conclusions

Experiment 2 – Function learning pretraining

Exploration to learn the function should depend on...

Uncertainty about the function.Type of function.Horizon.Expecting need for generalization.

Training phase

Mixed design – Two between factors: Type of function(Positive linear, Quadratic) x Horizon (100 or 30 trialsCMAB phase) and within factor (With or without functionlearning phase)

Function learning task – 100 trials with single alternative,same two features and function, accuracy incentivized.Same positive linear and quadratic functions as before, butalternatives now includes randomly drawn intercepts!425 participants – Amazon Turk – monetary payoffs.

56 / 75

Theexploration-exploitation

trade-off

PantelisPipergiasAnalytis

Exploration-exploitationproblems

Themulti-armedbanditframework

Strategies

Contextualbandits

Results from areal worldexperiment

Conclusions

Experiment 2 – Function learning pretraining

Exploration to learn the function should depend on...

Uncertainty about the function.Type of function.Horizon.Expecting need for generalization.

Training phase

Mixed design – Two between factors: Type of function(Positive linear, Quadratic) x Horizon (100 or 30 trialsCMAB phase) and within factor (With or without functionlearning phase)Function learning task – 100 trials with single alternative,same two features and function, accuracy incentivized.

Same positive linear and quadratic functions as before, butalternatives now includes randomly drawn intercepts!425 participants – Amazon Turk – monetary payoffs.

56 / 75

Theexploration-exploitation

trade-off

PantelisPipergiasAnalytis

Exploration-exploitationproblems

Themulti-armedbanditframework

Strategies

Contextualbandits

Results from areal worldexperiment

Conclusions

Experiment 2 – Function learning pretraining

Exploration to learn the function should depend on...

Uncertainty about the function.Type of function.Horizon.Expecting need for generalization.

Training phase

Mixed design – Two between factors: Type of function(Positive linear, Quadratic) x Horizon (100 or 30 trialsCMAB phase) and within factor (With or without functionlearning phase)Function learning task – 100 trials with single alternative,same two features and function, accuracy incentivized.Same positive linear and quadratic functions as before, butalternatives now includes randomly drawn intercepts!425 participants – Amazon Turk – monetary payoffs.

56 / 75

Theexploration-exploitation

trade-off

PantelisPipergiasAnalytis

Exploration-exploitationproblems

Themulti-armedbanditframework

Strategies

Contextualbandits

Results from areal worldexperiment

Conclusions

Mean choice ranks

Random performance Random performance

Linear Quadratic

1

2

3

4

5

6

7

8

9

10

11

12

13

1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10Block

Mea

n ra

nk o

f the

cho

sen

alte

rnat

ive

CMAB linearfCMAB linearfCMABs linearCMAB quadraticfCMAB quadraticfCMABs quadratic

57 / 75

Theexploration-exploitation

trade-off

PantelisPipergiasAnalytis

Exploration-exploitationproblems

Themulti-armedbanditframework

Strategies

Contextualbandits

Results from areal worldexperiment

Conclusions

Exploration in the feature space, first 10 trials

CMAB lin fCMAB lin fCMABs lin

CMAB quad fCMAB quad fCMABs quad

(0.1,0.3]

(0.3,0.5]

(0.5,0.7]

(0.7,0.9]

(0.1,0.3]

(0.3,0.5]

(0.5,0.7]

(0.7,0.9]

(0.1

,0.3

]

(0.3

,0.5

]

(0.5

,0.7

]

(0.7

,0.9

]

(0.1

,0.3

]

(0.3

,0.5

]

(0.5

,0.7

]

(0.7

,0.9

]

(0.1

,0.3

]

(0.3

,0.5

]

(0.5

,0.7

]

(0.7

,0.9

]

Feature 1

Feat

ure

2

0.1

0.2

0.3Proportion

One-shot choices

58 / 75

Theexploration-exploitation

trade-off

PantelisPipergiasAnalytis

Exploration-exploitationproblems

Themulti-armedbanditframework

Strategies

Contextualbandits

Results from areal worldexperiment

Conclusions

Exploration in the feature space, all trials

CMAB lin fCMAB lin fCMABs lin

CMAB quad fCMAB quad fCMABs quad

(0.1,0.3]

(0.3,0.5]

(0.5,0.7]

(0.7,0.9]

(0.1,0.3]

(0.3,0.5]

(0.5,0.7]

(0.7,0.9]

(0.1

,0.3

]

(0.3

,0.5

]

(0.5

,0.7

]

(0.7

,0.9

]

(0.1

,0.3

]

(0.3

,0.5

]

(0.5

,0.7

]

(0.7

,0.9

]

(0.1

,0.3

]

(0.3

,0.5

]

(0.5

,0.7

]

(0.7

,0.9

]

Feature 1

Feat

ure

2

0.10.20.30.4

Proportion

One-shot choices

59 / 75

Theexploration-exploitation

trade-off

PantelisPipergiasAnalytis

Exploration-exploitationproblems

Themulti-armedbanditframework

Strategies

Contextualbandits

Results from areal worldexperiment

Conclusions

Summary

People learn the function and generalize their knowledgeto new decision situations.

But there are inter-individual differences – some peoplerely on learning the function, others are naive learners;akin to model-based vs model-free RL.

New flavour of the exploration-exploitation trade-off –evidence that people simultaneously learn and optimizethe function.

Priors about the functional relationship can hurt theperformance.

People do not seem to take into account the time horizon.People exploit more aggressively when they have beenpre-trained on the function.

60 / 75

Theexploration-exploitation

trade-off

PantelisPipergiasAnalytis

Exploration-exploitationproblems

Themulti-armedbanditframework

Strategies

Contextualbandits

Results from areal worldexperiment

Conclusions

Summary

People learn the function and generalize their knowledgeto new decision situations.

But there are inter-individual differences – some peoplerely on learning the function, others are naive learners;akin to model-based vs model-free RL.

New flavour of the exploration-exploitation trade-off –evidence that people simultaneously learn and optimizethe function.

Priors about the functional relationship can hurt theperformance.

People do not seem to take into account the time horizon.People exploit more aggressively when they have beenpre-trained on the function.

60 / 75

Theexploration-exploitation

trade-off

PantelisPipergiasAnalytis

Exploration-exploitationproblems

Themulti-armedbanditframework

Strategies

Contextualbandits

Results from areal worldexperiment

Conclusions

Summary

People learn the function and generalize their knowledgeto new decision situations.

But there are inter-individual differences – some peoplerely on learning the function, others are naive learners;akin to model-based vs model-free RL.

New flavour of the exploration-exploitation trade-off –evidence that people simultaneously learn and optimizethe function.

Priors about the functional relationship can hurt theperformance.

People do not seem to take into account the time horizon.People exploit more aggressively when they have beenpre-trained on the function.

60 / 75

Theexploration-exploitation

trade-off

PantelisPipergiasAnalytis

Exploration-exploitationproblems

Themulti-armedbanditframework

Strategies

Contextualbandits

Results from areal worldexperiment

Conclusions

Summary

People learn the function and generalize their knowledgeto new decision situations.

But there are inter-individual differences – some peoplerely on learning the function, others are naive learners;akin to model-based vs model-free RL.

New flavour of the exploration-exploitation trade-off –evidence that people simultaneously learn and optimizethe function.

Priors about the functional relationship can hurt theperformance.

People do not seem to take into account the time horizon.People exploit more aggressively when they have beenpre-trained on the function.

60 / 75

Theexploration-exploitation

trade-off

PantelisPipergiasAnalytis

Exploration-exploitationproblems

Themulti-armedbanditframework

Strategies

Contextualbandits

Results from areal worldexperiment

Conclusions

Summary

People learn the function and generalize their knowledgeto new decision situations.

But there are inter-individual differences – some peoplerely on learning the function, others are naive learners;akin to model-based vs model-free RL.

New flavour of the exploration-exploitation trade-off –evidence that people simultaneously learn and optimizethe function.

Priors about the functional relationship can hurt theperformance.

People do not seem to take into account the time horizon.People exploit more aggressively when they have beenpre-trained on the function.

60 / 75

Theexploration-exploitation

trade-off

PantelisPipergiasAnalytis

Exploration-exploitationproblems

Themulti-armedbanditframework

Strategies

Contextualbandits

Results from areal worldexperiment

Conclusions

Summary

Challenges and future directions

Goals is to develop a function learning based RL model –algorithmic level.

Moreover, it is difficult to fit function learning modelswithout prediction data. However, asking for predictionsalong with choices changes the behaviour.

How do people behave in the presence of informationabout the alternatives and other contextual information?

61 / 75

Theexploration-exploitation

trade-off

PantelisPipergiasAnalytis

Exploration-exploitationproblems

Themulti-armedbanditframework

Strategies

Contextualbandits

Results from areal worldexperiment

Conclusions

Summary

Challenges and future directions

Goals is to develop a function learning based RL model –algorithmic level.

Moreover, it is difficult to fit function learning modelswithout prediction data. However, asking for predictionsalong with choices changes the behaviour.

How do people behave in the presence of informationabout the alternatives and other contextual information?

61 / 75

Theexploration-exploitation

trade-off

PantelisPipergiasAnalytis

Exploration-exploitationproblems

Themulti-armedbanditframework

Strategies

Contextualbandits

Results from areal worldexperiment

Conclusions

Summary

Challenges and future directions

Goals is to develop a function learning based RL model –algorithmic level.

Moreover, it is difficult to fit function learning modelswithout prediction data. However, asking for predictionsalong with choices changes the behaviour.

How do people behave in the presence of informationabout the alternatives and other contextual information?

61 / 75

Theexploration-exploitation

trade-off

PantelisPipergiasAnalytis

Exploration-exploitationproblems

Themulti-armedbanditframework

Strategies

Contextualbandits

Results from areal worldexperiment

Conclusions

Summary

Challenges and future directions

Goals is to develop a function learning based RL model –algorithmic level.

Moreover, it is difficult to fit function learning modelswithout prediction data. However, asking for predictionsalong with choices changes the behaviour.

How do people behave in the presence of informationabout the alternatives and other contextual information?

61 / 75

Theexploration-exploitation

trade-off

PantelisPipergiasAnalytis

Exploration-exploitationproblems

Themulti-armedbanditframework

Strategies

Contextualbandits

Results from areal worldexperiment

Conclusions

Acknowledgments

Funding:

FPU grant, Ministry of Education, Culture and Sports,Spain

Max Planck Institute for Human Development, Berlin

Barcelona Graduate School of Economics

62 / 75

Theexploration-exploitation

trade-off

PantelisPipergiasAnalytis

Exploration-exploitationproblems

Themulti-armedbanditframework

Strategies

Contextualbandits

Results from areal worldexperiment

Conclusions

Quadratic function – An illustration

Experimental design

63 / 75

Theexploration-exploitation

trade-off

PantelisPipergiasAnalytis

Exploration-exploitationproblems

Themulti-armedbanditframework

Strategies

Contextualbandits

Results from areal worldexperiment

Conclusions

Individual behaviour in the training phase -Experiment 1a

●●

●●●

●●●●

●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

5

10

15

20

0 25 50 75 100

Ran

k of

the

chos

en a

ltern

ativ

e

Choice behavior of subject e2−0124,CMABn condition, experiment LowNoise

Mean choice rank64 / 75

Theexploration-exploitation

trade-off

PantelisPipergiasAnalytis

Exploration-exploitationproblems

Themulti-armedbanditframework

Strategies

Contextualbandits

Results from areal worldexperiment

Conclusions

Individual behaviour in the training phase -Experiment 1a

●●

●●

●●

●●●

●●●●●●●

●●●●

●●

●●●●

●●●

●●

●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●

●●●●●●●●

5

10

15

20

0 25 50 75 100

Ran

k of

the

chos

en a

ltern

ativ

e

Choice behavior of subject e2−0065,CMABn condition, experiment LowNoise

Mean choice rank65 / 75

Theexploration-exploitation

trade-off

PantelisPipergiasAnalytis

Exploration-exploitationproblems

Themulti-armedbanditframework

Strategies

Contextualbandits

Results from areal worldexperiment

Conclusions

Mean choice rank – Lab replication

Random performance

1

2

3

4

5

6

7

8

9

10

11

12

13

1 2 3 4 5Block

Mea

n ra

nk o

f the

cho

sen

alte

rnat

ive

MABCMAB

Mean choice rank - Exp 1a66 / 75

Theexploration-exploitation

trade-off

PantelisPipergiasAnalytis

Exploration-exploitationproblems

Themulti-armedbanditframework

Strategies

Contextualbandits

Results from areal worldexperiment

Conclusions

One-shot choices – Lab replication

CMABn

Diff/Extra

CMABn

Diff/Inter

CMABn

Easy/Extra

CMABn

Easy/Inter

CMABn

Weight test

0.00

0.25

0.50

0.75

1.00

1 2 3 1 2 3 1 2 3 1 2 3 1 2 3Rank of the chosen alternative

Mea

n pr

opor

tion

of c

hoic

es

One-shot choices - Exp 1a

67 / 75

Theexploration-exploitation

trade-off

PantelisPipergiasAnalytis

Exploration-exploitationproblems

Themulti-armedbanditframework

Strategies

Contextualbandits

Results from areal worldexperiment

Conclusions

Feature space, all trials – Lab replication

MAB CMABn

(0.1,0.3]

(0.3,0.5]

(0.5,0.7]

(0.7,0.9]

(0.1

,0.3

]

(0.3

,0.5

]

(0.5

,0.7

]

(0.7

,0.9

]

(0.1

,0.3

]

(0.3

,0.5

]

(0.5

,0.7

]

(0.7

,0.9

]

Feature 1

Feat

ure

2

0.10.20.3

Proportion

Back to Exp 1a

68 / 75

Theexploration-exploitation

trade-off

PantelisPipergiasAnalytis

Exploration-exploitationproblems

Themulti-armedbanditframework

Strategies

Contextualbandits

Results from areal worldexperiment

Conclusions

Feature space, first 10 trials – Lab replication

MAB CMABn

(0.1,0.3]

(0.3,0.5]

(0.5,0.7]

(0.7,0.9]

(0.1

,0.3

]

(0.3

,0.5

]

(0.5

,0.7

]

(0.7

,0.9

]

(0.1

,0.3

]

(0.3

,0.5

]

(0.5

,0.7

]

(0.7

,0.9

]

Feature 1

Feat

ure

2

0.050.100.150.200.25

Proportion

Back to Exp 1a

69 / 75

Theexploration-exploitation

trade-off

PantelisPipergiasAnalytis

Exploration-exploitationproblems

Themulti-armedbanditframework

Strategies

Contextualbandits

Results from areal worldexperiment

Conclusions

Mean choice rank – Mixed and quadratic

Random performance Random performance

Mixed Quadratic

1

2

3

4

5

6

7

8

9

10

11

12

13

1 2 3 4 5 1 2 3 4 5Block

Mea

n ra

nk o

f the

cho

sen

alte

rnat

ive

MAB mixedCMAB mixedMAB quadraticCMAB quadratic

Feature space, all Feature space, subset

70 / 75

Theexploration-exploitation

trade-off

PantelisPipergiasAnalytis

Exploration-exploitationproblems

Themulti-armedbanditframework

Strategies

Contextualbandits

Results from areal worldexperiment

Conclusions

One-shot choices – Mixed and quadratic

CMAB mixed

Easy

CMAB mixed

Difficult

CMAB mixed

Weight test

CMAB quadratic

Max test 1

CMAB quadratic

Min test 1

CMAB quadratic

Slope test 1

CMAB quadratic

Max test 2

CMAB quadratic

Min test 2

CMAB quadratic

Slope test 2

0.000.250.500.751.00

0.000.250.500.751.00

0.000.250.500.751.00

1 2 3 1 2 3 1 2 3Rank of the chosen alternative

Mea

n pr

opor

tion

of c

hoic

es

Feature space, all Feature space, subset

71 / 75

Theexploration-exploitation

trade-off

PantelisPipergiasAnalytis

Exploration-exploitationproblems

Themulti-armedbanditframework

Strategies

Contextualbandits

Results from areal worldexperiment

Conclusions

Mean choice rank – Positive and quadratic

Random performance Random performance

Linear Quadratic

1

2

3

4

5

6

7

8

9

10

11

12

13

1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10Block

Mea

n ra

nk o

f the

cho

sen

alte

rnat

ive

CMAB linearfCMAB linearfCMABs linearCMAB quadraticfCMAB quadraticfCMABs quadratic

Feature space, all Feature space, subset

72 / 75

Theexploration-exploitation

trade-off

PantelisPipergiasAnalytis

Exploration-exploitationproblems

Themulti-armedbanditframework

Strategies

Contextualbandits

Results from areal worldexperiment

Conclusions

One-shot choices – Positive linear

CMAB lin

Easy

CMAB lin

Difficult

CMAB lin

Weight test

fCMAB lin

Easy

fCMAB lin

Difficult

fCMAB lin

Weight test

0.00

0.25

0.50

0.75

1.00

0.00

0.25

0.50

0.75

1.00

1 2 3 1 2 3 1 2 3Rank of the chosen alternative

Mea

n pr

opor

tion

of c

hoic

es

Feature space, all Feature space, subset

73 / 75

Theexploration-exploitation

trade-off

PantelisPipergiasAnalytis

Exploration-exploitationproblems

Themulti-armedbanditframework

Strategies

Contextualbandits

Results from areal worldexperiment

Conclusions

One-shot choices – Quadratic

CMAB q

Max test 1

CMAB q

Min test 1

CMAB q

Slope test 1

CMAB q

Max test 2

CMAB q

Min test 2

CMAB q

Slope test 2

fCMAB q

Max test 1

fCMAB q

Min test 1

fCMAB q

Slope test 1

fCMAB q

Max test 2

fCMAB q

Min test 2

fCMAB q

Slope test 2

0.00

0.25

0.50

0.75

1.00

0.00

0.25

0.50

0.75

1.00

1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3Rank of the chosen alternative

Mea

n pr

opor

tion

of c

hoic

es

Feature space, all Feature space, subset

74 / 75

top related