cognitive modeling / university of groningen / / artificial intelligence |rensselaer| cognitive...

Cognitive Modeling/University

of Groningen

/ /Artificial Intelligence

|RENSSELAER| Cognitive ScienceCogWorksLaboratories

› Christian P. Janssen

› Wayne D. Gray

› Michael J. Schoelles

How a Modeler’s Conception of Rewards Influences a Model’s behaviorInvestigating ACT-R 6’s utility learning mechanism


of Groningen



2

Temporal difference learning & ACT-R

› Temporal difference learning has recently been introduced as ACT-R’s new utility learning mechanism (e.g., Fu & Anderson, 2004; Anderson, 2006, 2007; Bothell, 2005)

› Utility learning learns to optimize behavior as to maximize the rewards that the model receives

› A model can:• Receive rewards at different moments in times• Receive rewards of different magnitudes

› There are no guidelines for choosing when a reward should be given and what its magnitude should be


of Groningen



3

New issues for ACT-R

› We studied two aspects of TD learning:• When is reward given• Magnitude of the reward

› This a new issue for ACT-R• When is reward given: could be varied in ACT-R 5

• Magnitude of reward: could not be varied in ACT-R 5

› As we will show, the modeler’s conception of rewards has a big influence on a model’s behavior

› Case study: Blocks World task (Gray et al., 2006)


of Groningen



4

Why the Blocks World task?

› Previous work indicates that the utility learning mechanism is crucial for this task• ACT-R 5 models (Gray, Sims, Schoelles, 2005)

• Regular ACT-R 5 can not provide a good fit to the human data

• Because rewards in ACT-R 5 are binary (i.e., successes and failures) and not scalar

• Ideal Performer Model (Gray et al., 2006)• Model outside of ACT-R that uses temporal difference learning provided a very good fit (Gray et al., 2006)


of Groningen



5

Blocks World task

› So what’s the task?


of Groningen



6

Blocks World task

Task: “Copy pattern in target window by moving blocks from resource window to workspace

window”


of Groningen



7

Blocks World task

Windows are covered with gray rectangles:Accessing information requires interaction

with the interface


of Groningen



8

Blocks World task


with the interface


of Groningen



9

Blocks World task


with the interface


of Groningen



10

Blocks World task


with the interface


of Groningen



11

Blocks World task

› Blocks world task:• Information in Target Window is only available after waiting for a lockout time • 0, 400 or 3200 milliseconds (between subjects)


of Groningen



12

Blocks World task: human data (Gray et al., 2006)

› Size of lockout time influences human behavior:

0

1

2

3

4

5

0.0 1.0 2.0 3.0

Number of blocks placedafter 1st visit to target window

Lockout Time [s]


of Groningen



13

Blocks World task: Modeling Strategies

› Strategy: How many blocks do you plan to place after a visit to the target window?

› 8 encode-x production rules• “study x blocks”• Encode-1 till encode-8

› Model learns utility value of each production rule using ACT-R’s temporal difference learning algorithm


of Groningen



14

Utility learning

› Utility learning requires the incorporation of rewards

› Two choices are crucial:• When is the reward is given?• What is the magnitude of the reward?

› After some experience, the utility of a production rule approximates (Anderson, 2007):

€

U i = r(tx ) − (tx − ti)

Magnitude When is reward given


of Groningen



15

Utility learning

› Choice 1: When is the reward given?› Important because:

• Utility value has a linear relationship with the the time at which the reward is given

› Choice in Blocks World• Once model: Update once, at the end of the trial

• Each model: Update each time that part of the task is completed.• A (set of) block(s) has been placed and the model either returns to the target window to study more blocks, or finishes the trial

€

U i = r(tx ) −Δtt i ,tx


of Groningen



16

Utility learning

› Choice 2: magnitude of the reward› Important because:

• Utility value has a linear relationship with the magnitude of the reward

› But how to set this value?• Experimental tweaking? -> unfavorable• Fixed range of values? (e.g., between 0 and 1) -> difficult

• Relate to neurological data? -> not available for most models

€

U i = r(tx ) −Δtt i ,tx


of Groningen



17

Utility learning

› Choice 2: magnitude of the reward› Choice in Blocks World:

• Relate the reward to what might be important in the task

• Accuracy: Accuracy with which task is performedOptions:• Success: # blocks placed (once)• Success: # blocks placed (each)• Success & Failure: # blocks placed - #blocks forgotten (each model)

• Time: How much time does (part of the) task take?Options:• Time spend on the task: -1 * time spend (once)• Time spend waiting for specific aspect of the task: -1 * lockout size * number of visits to target window (once)

• Number of blocks placed per second (each)


of Groningen



18

Blocks World task: Modeling Strategies

› 6 models were developed› Each model is run 6 times for each of 3 experimental conditions:• 0, 400 and 3200 milliseconds

› Models interact with the same interface as human participants


of Groningen



19

Blocks World task: general results

› Each model has unique results


of Groningen



20

Blocks World task: general results

› What is the impact of:• When the reward is given (once/each)• The concept of the reward (related to accuracy/time)

› Results averaged over 3 models


of Groningen



21

Utility learning: impact of when reward is given


of Groningen



22

Utility learning: impact of concept of reward


of Groningen



23

Comparison with ACT-R 5 (Gray, Sims & Schoelles, 2005)


of Groningen



24

Conclusion

› Rewards can be given at different times during a trial and according to different concepts

› There are no guidelines what the best choices are

› Blocks World suggests that rewards should:• Be given once: Model can optimize behavior over entire task

• Relate to concept of time: because different strategy choices have a big impact on reward size

› Models of other tasks should point out if this is consistent


of Groningen



25

Conclusion

› This is not just a Blocks World issue• General Computer Science / AI issue: representing a task in the right way is crucial(e.g., Russell & Norvig, 1995; Sutton & Barto, 1998)

• Many experiments involve manipulations and measurements of accuracy and speed of performance

› This a new issue for ACT-R• When is reward given: could be varied in ACT-R 5

• Magnitude of reward: could not be varied in ACT-R 5


of Groningen



26

Thank you for your attention

› Questions?

› More information:• [email protected]• www.ai.rug.nl/~cjanssen• www.cogsci.rpi.edu/cogworks• Poster Session @ CogSci 2008 Thursday, July 24th “Cognitive Models of Strategy Shifts in Interactive Behavior”(session: “Attention and Implicit Learning”)


of Groningen



27

References

› Anderson, J. R. (2006). A new utility learning mechanism. Paper presented at the 2006 ACT-R workshop.

› Anderson, J. R. (2007). How can the human mind occur in the physical universe? New York: Oxford University Press.

› Bothell, D. (2005). ACT-R 6 Official Release. Proceedings of the 12th ACT-R Workshop.

› Fu, W. T., & Anderson, J. R. (2004). Extending the computational abilities of the procedural learning mechanism in ACT-R. Proceedings of the 26th annual meeting of the Cognitive Science Society, 416-421.

› Gray, W. D., Schoelles, M. J., & Sims, C. R. (2005). Adapting to the task environment: Explorations in expected value. Cognitive Systems Research, 6(1), 27-40.

› Gray, W. D., Sims, C. R., Fu, W. T., & Schoelles, M. J. (2006). The soft constraints hypothesis: A rational analysis approach to resource allocation for interactive behavior. Psychological Review, 113(3), 461-482.

› Russell, S. J., & Norvig, P. (1995). Artificial intelligence: a modern approach. Upper Saddle River, NJ: Prentice-Hall, Inc.

› Sutton, R. S., & Barto, A. G. (1998). Reinforcement learning: An introduction. Cambridge, MA: MIT Press.

cognitive modeling / university of groningen / / artificial intelligence |rensselaer| cognitive...

Documents