a comparison of models for learning how to dynamically integrate multiple cues in order to forecast...

23

Click here to load reader

Upload: hugh-kelley

Post on 26-Jun-2016

214 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: A comparison of models for learning how to dynamically integrate multiple cues in order to forecast continuous criteria

Journal of Mathematical Psychology 52 (2008) 218–240www.elsevier.com/locate/jmp

A comparison of models for learning how to dynamically integrate multiplecues in order to forecast continuous criteria

Hugh Kelleya,∗, Jerome Busemeyerb

a Copenhagen University, Frederiksberg C, Denmarkb Indiana University, Bloomington, IN 47405, United States

Received 25 April 2006; received in revised form 31 October 2007Available online 4 June 2008

Abstract

Is human learning strongly adapted to the specific function learning task to which it is applied or is it a more general characteristic? This studyaddresses this question by empirically comparing the performance of five dynamic learning models across eleven different continuous criterionfunction learning tasks. We contrast three variants of rule-based and associative ‘neural network’ models with two variants of a Bayesian regressionforecasting model. The tasks involve: deterministic and stochastic functions, functions with equal and unequal stimuli weights, functions with largeand small numbers of stimuli, and linear and nonlinear functions. Evidence of task specificity would be implied if the most descriptive model oflearning does systematically vary by task and subject; the alternative independence hypothesis is implied if there are no performance differences.We find two primary results: first, there is evidence of the task independence of learning; and the most valid model is a neural network variant.However, if the criterion variance is large or there are a large number of cues relevant for making predictions, the results favor Bayesian forecastingmethods for providing reliable and valid predictions of human responses.c© 2008 Elsevier Inc. All rights reserved.

Keywords: Forecasting; Least squares; Neural network; Model comparison; Function learning; Experiment

1. Introduction

Characterizing how humans learn to forecast a continuouslydistributed criterion value on the basis of information providedby two or more continuously varying cues has been an objectiveof interest for experimental social scientists for some time.Klayman (1988) and Lichtenstein and Slovic (1971) describeearly experimental research exploring how accurately humanscan learn to make such forecasts. Holzworth (1999) provides acomprehensive catalogue of research from this literature.

Numerous important findings have emanated from pastresearch on multiple cue function learning: (1) Additivecombination rules are learned faster than non-additivecombination rules (Mellers, 1981), but this depends onpsychophysical scales (Koh, 1993). (2) Participants candetect shifts in cue relevance during training, but learningrate is slower after the shift than prior to the shift

∗ Corresponding author.E-mail address: [email protected] (H. Kelley).

0022-2496/$ - see front matter c© 2008 Elsevier Inc. All rights reserved.doi:10.1016/j.jmp.2008.01.009

(Peterson, Hammond, & Summers, 1965; Summers, 1969). (3)Participants, when trained to forecast a target given cues, donot follow the algebraic inverse of the known function whenthey are alternatively asked to predict cues given the target,(Surber, 1987). This suggests some type of asymmetry in theassociative learning process. (4) Increasing the validity of onecue, while holding the validity of a second cue constant, reducesthe impact of the second (Birnbaum, 1976; Busemeyer, Myung,& McDaniel, 1993; Mellers, 1986). (5) Adding a weakly validsecond cue to a task that already has a highly valid firstcue reduces performance rather than enhancing performance(Dudycha & Naylor, 1966).

Until recently, most of this empirical research has beencarried out with minimal effort toward building generaltheory, and with little attempt to make connections withmain stream cognitive psychology. A notable exception isKalish, Lewandowsky, and Kruschke (2004) which attemptsto integrate a number of features identified in earlier functionlearning studies. Despite this nearly contemporaneous work,many findings lie dormant awaiting a theoretical explanation

Page 2: A comparison of models for learning how to dynamically integrate multiple cues in order to forecast continuous criteria

H. Kelley, J. Busemeyer / Journal of Mathematical Psychology 52 (2008) 218–240 219

Variable definitions

Conventions: Bold lower case refers to a row vector rep-resenting one point in time from an across-timematrix, with columns representing predictor vari-ables. Bold upper case refers to the entire across-time and across-predictor variables matrix. Non-bold variables refer to scalars or scalar elementsof a vector or matrix.

i scalar representing total number of experimentalcues (2 or 5).

j scalar index ranging from 1 to 21 indicating aparticular input category for cue 1.

k scalar index ranging from 1 to 21 indicating aparticular input category for cue 2.

bin scalar representing number of categories intowhich the input and output spaces are divided.

m scalar index ranging from 1 to bin2 representingall possible stimuli combinations.

n scalar index ranging from 1 to 21 indicating aparticular output category.

trials number of repeated trials in the experiment (350or 480).

t scalar ranging from 1 to trials representingcurrent trial/period.

win estimated free parameter scalar representing thenumber of the most recent trials that are estimatedto be in a subject’s memory.

δ estimated free parameter scalar representing asubject’s specific optimized learning rate.

ω estimated free parameter scalar representing asubject’s specific optimized decay of learninggiven new information.

γ estimated free parameter scalar representing asubject’s specific optimized activation gradient.

p across-time vector of target variable changes.X across-time matrix of stimuli realizations.xt vector of stimuli realizations for time t .Z across-time matrix of stimuli and polynomial

term realizations.zt vector of stimuli and polynomial term realiza-

tions for time t .f across-time vector of a subject’s forecast.g across-time vector of a model’s prediction.a vector with i elements representing all constant

learned stimuli weights.α an m × n × t matrix representing the model’s

estimates of the stimuli weights a for each stimulicombination on each trial.

ρ m × 1 vector of output or teaching valueactivations.

χ m × 1 vector of the bin2 possible input nodeactivation combinations for 2 stimuli.

R2 scalar representing the correlation between asubject’s forecast and a model’s prediction.

ε Uniformly distributed random variable, whichrepresents the stochastic aspect of the learnedfunctions.

On The ALM model’s time vector of predicted outputactivations.

On,t The ALM model’s time t scalar predicted outputactivation.

Am(xt ) The ALM model’s time vector of input activa-tions.

Am,t (xt ) The ALM model’s time t scalar input activationof node m.

Fn(ρt ) The ALM model’s across-time activation of theteaching output node n produced by target ρ.

to breathe some life back into this massive body of research.This is an unfortunate state of affairs because knowledge offunctional relations between causes and effects is a majorcomponent of human conceptual behavior and is especiallyrelevant to other theoretical social sciences such as economics.

The primary goal of this work is to perform a modelcomparison of classic economics and psychology forecastingmodels across a number of unique tasks. This will informus about which model (or more accurately which modelfeatures) are the most important features for providing anaccurate description of our subjects. Our secondary goals areto determine the degree of across-subject heterogeneity, basedupon estimated parameters. This helps to inform us aboutwhether the typical assumption of homogeneous agents in theeconomic theory is relevant in the function learning context.Additionally, we investigate if model performance varies acrossfunction learning task. This will inform us if certain featuresof the environment affect models’ ability to describe humansubjects. This provides general insights into the task specificityof learning.

In general one can conceptualize a function learning taskas the integration of potentially multi-dimensional stimuli xinto a prediction regarding a uni-dimensional target outcomez given some stimuli weights a and a functional relationshipz = f (x|a). Subjects observe x as stimuli and z as feedback,and their job is to form subjective estimates α of the trueobjective weights a in order to provide a forecast f . To date theliterature has focused on single cue tasks with various linear andnonlinear functional relationships. The current study focusesonly on multi-cue settings. Another element of these tasks isthe performance of extrapolation, i.e. providing forecasts ormodel predictions for stimuli outside the previously observedrange. All of our tasks involve extrapolation due to the randompresentation of stimuli, and therefore criteria, across their entirerange. Until a model or subject has observed the extremes,predictions and forecasts are extrapolations.

Within this experimental context earlier precedents suggesttwo different types of psychological theories of human functionlearning. Koh and Meyer (1991), building on previous theoriesby Brehmer (1973) and Carroll (1963), proposed a polynomialrule-based learning model. These models assume that the

Page 3: A comparison of models for learning how to dynamically integrate multiple cues in order to forecast continuous criteria

220 H. Kelley, J. Busemeyer / Journal of Mathematical Psychology 52 (2008) 218–240

function f is a polynomial combination of stimuli. For example,employing a second-order polynomial assumes that the subjectcombines one stimuli and makes forecasts with z = α1x1 +

α2x21 , despite the fact that there may or may not be second

or higher-order relations. And further, this function is assumedwhen using a polynomial regression technique to recover thefull sample estimates α of the true weights a. In other wordsthe estimated weights in this work are optimal for the entire setof presented trials, rather than dynamically updated as stimuliare received.

Alternatively, Delosh, Busemeyer, Byun, and McDaniel(1997) proposed an associative model for function learningand extrapolation (EXAM). It is assumed that subjects updateassociation weights linking input stimuli and output nodeactivations as each new error signal arrives. Given stimuli,output nodes are activated, and via a comparison of relativeoutput node activation a continuous forecast is probabilisticallyprovided. Crucially, the associative learning model, whenupdating subjective association weights, allows weights notdirectly associated with the currently observed stimuli alsoto be updated according to a magnitude that decreases withtheir distance from the observed stimuli value. This lastfeature has the added benefit of making this model extremelyeconomical with data. One no longer needs to observe a stimulirealization for each possible discrete category before making anaccurate prediction, unlike in earlier models without spilloveractivation. Delosh et al. (1997) compared the rule-based andassociative models in terms of their ability to describe humanperformance using a single cue that was deterministicallyrelated to the criterion. In this situation, they found evidencesupporting the associative learning model over the polynomialrule-based model. Juslin, Olsson, and Olsson (2003) comparedthese two classes of models using multiple cues that wereprobabilistically related to the criterion. Under these conditions,evidence supporting a polynomial rule is provided. Thisinconsistency suggests that the approaches should be comparedacross a wider range of tasks. The representative of theassociative approach we include in our comparison is basedupon the EXAM representation, with two novel extensions.

More recently a more general and comprehensive theoryof function learning and extrapolation has been proposed byKalish et al. (2004) and Lewandowsky, Kalish, and Ngang(2000). In their POLE model the authors investigate the conceptof knowledge partitioning, and whether this form of stimuliintegration may reflect an aspect of human function learning.Knowledge partitioning assumes that subjects break apart anyset of stimuli or information they receive into separate parcelswhich they then individually evaluate in order to chooseamong a finite set of alternative functions used for makinga response. Alternative functions in this setting differ basedupon the intercept and slope parameters applied to observedstimuli when making a forecast. As feedback is observed aboutaccuracy of forecasts relative to the target, subjects updatethe weights associated with each functional relation and mayemploy a new function in following trials as its associationweight changes. Their model is general enough to handlemultiple cue settings, but empirical analysis reported so far

has been restricted to a single cue and response settings. Theirmodel is uniquely qualified to consider nonlinear and non-monotonic functional relationships due to its ability to breakapart any problem into piecewise linear segments. The authorsfind evidence that knowledge partitioning is a key feature oflearning and extrapolation to be considered in future work.

There are a number of reasons why this model isnot included in the current study despite its relevance forextrapolation. First, although the model has been constructedto handle multiple stimuli, it currently has been compared tosingle cue data, and applying the model to our multi-cue datawas beyond the purview of this work. Secondly, the majority ofour tasks were linear and the concept of knowledge partitioningin nonlinear tasks would be most relevant for only three of oureleven treatments. This would add a not insurmountable, butrather important confounding factor when comparing linear andnonlinear tasks; in particular how does one chose the piecewiselinear functions? Thirdly, distinguishing extrapolation versusnon-extrapolation performance, a key feature of POLE, was nota focus of the current work. Less importantly, at the time thiswork was conducted POLE was not yet available in a form thatcould allow application to our tasks.

Interestingly, although we do not directly consider thePOLE model, there is an important similarity of the POLEmodel compared to the rule-based neural network techniquesconsidered here. In the POLE model there are a finite set ofa priori parameter specific rules that subjects choose amongas they adapt their forecast. In the current work, we estimatethe learning rate as a free parameter which then operates via aneural network rule to adapt all association weights (functionparameters) needed for making a forecast. As stimuli andfeedback within the experiment are provided these weightsdynamically change, and the functional form of the rule thesubject is assumed to use to make forecasts also changes.In fact, it is possible that any of the possible functions thatcan be created with variations in the function parameters canbe used to make a forecast. In other words, while with the1-cue POLE model subjects are assumed to have access toa set of 4–12 possible a priori (parameter specific) functionsfor forecasting, in the current multi-cue context this set offunctions is much larger. In fact, it includes all possible discretestimuli weight combinations, and these parameters need not bespecified a priori. Thus, we do not directly consider POLE.However, support for neural network or associative models canbe considered indirect evidence for POLE due to the similarityof the weights updating and prediction process. Future workmay attempt to disentangle the neural network rule-based,associative, and POLE model predictions. The current workcompares the Bayesian and neural network rule-based, andassociative predictions.

Forecasting is also a central topic in economics and severaltheories of learning have appeared in this literature, seeDiebold and Lopez (1996) for a historical survey of economicforecasting techniques. Traditionally, economists have stronglyrelied on least-squares regression and related Bayesian rule-based techniques when forming expectations or makingassumptions about economic decision makers’ learning, see

Page 4: A comparison of models for learning how to dynamically integrate multiple cues in order to forecast continuous criteria

H. Kelley, J. Busemeyer / Journal of Mathematical Psychology 52 (2008) 218–240 221

Bray (1985) and Marcet and Sargent (1989a,b,c) for classictheoretical treatments. However, recently economists havebegun to employ simple one-layer feed-forward neural networkmodels for forecasting or as representations of economicagents, see Diebold (1998) and Gonzalez (2000) for recentmacroeconomics applications; and Kuan and Liu (1995),Refenes, Zapranis, and Francis (1994), and Wong and Selvi(1998) for finance applications. Unfortunately, employingmodel comparisons to determine which features of these neuralnetwork variants are relevant for human expectations formationis difficult with field data due to the complexity and unobservedcomponents present in such economic contexts.

In order to provide a crossover among the psychology andeconomics literatures, and to eliminate field residual variability,the primary goal of the current work is to provide a modelcomparison of classic psychology and economics learningmodels with experimental data. A key feature of the modelswe consider is that they are time-series or dynamic learningmodels in the sense that they make a trial-by-trial forecastand updates to subjective functional weights. We do not focuson asymptotic forecasts of the models which may not be thatdifferent. Instead, we focus on the path to the asymptoticoutcome. Thus, any model performance variations we identifydescribe how subjects’ entire process of learning may be betteror worse described by a particular model. Of course thereare other models in both the economics and psychologicaltraditions we could include in our comparison. However, wefocus primarily on classical models in order to provide abroad statement about which of the core features highlightedby these alternative traditions are most relevant to humanfunction learning. Experimental rather than field data waschosen for the model comparisons to provide stronger controlover the stimuli and to isolate individual-forecaster featureswhich is necessary for rigorous model testing. A wide varietyof experimental parameters and functional forms were selectedfor this comparison on the basis of two criteria. The firstis that these functions have appeared in previous psychologyexperiments. We felt that it is important to determine how welleach of our models can explain findings characterizing learningacross these earlier benchmark studies, see Lichtenstein andSlovic (1971) for a survey. Second, these functions have alsoappeared in previous economic applications. Although we areinterested in general theories of learning, we are also interestedin quantifying how valid the models are for predicting humanresponses in functional relationships observed in this field,see MasCollel, Whinston, and Green (1995) for a survey ofeconomics oriented function forecasting tasks.

The models are compared on the basis of data from 3experiments that include 11 conditions which vary factorsincluding: the number of cues ranging from two to five cues,the validity of the cues with cues of equal or unequal influence,linear versus nonlinear relations between each cue and thecriterion, additive versus non-additive cue combination rules,shifts in cue relationships midway during training, and differentamounts of random noise added to the function which forms thecriterion. We believe that to begin a comparative investigationof field conditions that are likely to produce quantitatively

different learning dynamics requires us to characterize learningacross: variations in the function form and numbers ofsalient stimuli; but also requires variations in the parametricassumptions, including the signal-to-noise ratio and functionparameters. The uniqueness of these tasks is confirmed bycomparing how statistically distinguishable an optimal learner’spredictions are for each task.

Two secondary goals of this research are: to quantify theextent of across-participant heterogeneity in terms of the mostdescriptive model and estimated parameters; and, to determineif the most descriptive model of subjects varies across thefunction task. As in the previous literature we quantifysubject heterogeneity using within-treatment comparisons ofsubjects estimated parameters and the goodness-of-fit measuresrelative to the distribution central tendency. If we reject thehypothesis that the majority of subjects’ parameter estimatesor goodness-of-fit measures equals the distribution medianfor a given treatment, we have identified evidence of subjectheterogeneity. We assess the effect of task difficulty and across-task generalizability by using across-treatment comparisonsof measures of fit. If all tasks present an equally difficulttask, goodness-of-fit distributions across subjects should bestatistically indistinguishable across treatments. If we rejectthe hypothesis that subjects’ goodness-of-fit distribution equalsthat from another treatment we have evidence of variationin task difficulty; and lower goodness-of-fit may indicatehigher difficulty. For our claims regarding generalizability ofour results our basic assumption is that the various learningmodels capture unique aspects of human learning that may betask specific. And therefore, variations in model performancebeyond the variations due to simple subject variability willbe due to changes in how learning occurs for a given task.Since our tasks are quite general, there is no specific reasonto believe that the model performance variations we observefor these tasks in this data are not generalizable to settingswith similar functions and statistical properties. Further, our useof quite heterogeneous experimental data implies our resultsare at least as generalizable as the previous literature. Ourdata come from different disciplines, experimental economicsversus experimental psychology, different task contexts,financial forecasting versus medical forecasting, across threeUniversities, and over a number of years. This representsuniquely broad set of variations of function properties andexperimental methodology.

2. Experimental data sets

Common features of all studies include: (1) Subjects providea continuous forecast ft of a continuous criterion target pt ,given observed values of a column vector of i > 1 differentcontinuous stimulus cues, xt = [x1t , x2t , . . . xi t ]

T. (2) All tasksprovide a repeated trial framework, where subjects forecast thetarget variable and receive feedback on each of several hundredindependent trials. (3) In all cases, the column vector containingk > 1 objective parameters, a = [a1, a2, . . . , ak]

T, which mapstimulus cue values xt to a target criterion variable pt are set bythe experimenter and are initially unknown to the participant.

Page 5: A comparison of models for learning how to dynamically integrate multiple cues in order to forecast continuous criteria

222 H. Kelley, J. Busemeyer / Journal of Mathematical Psychology 52 (2008) 218–240

Any knowledge about these parameters must be learned fromexperience.

The first study involved 130 paid and unpaid undergraduatestudents from the University of Santa Cruz psychology subjectpool participating in a 480 trial stock price forecasting task.This study, referred to as the OJ data, used only two cues andvaried the relative validity of the cues, the type of feedbackprovided, payoffs for accuracy, the amount of error varianceadded to the criterion, and provided structural breaks in thecue validities midway during training. The second data set, theMFPF data, involved 55 paid undergraduates from the IndianaUniversity psychology subject pool participating in a 480 trialprice forecasting task. This study used five cues and variedthe validity of the cues and the amount of noise added tothe criterion. The third data set, the ROE data, involved 53unpaid undergraduates from the Purdue University psychologysubject pool participating in a 350 trial medical forecastingtask. This study varied the combination rule used to mapcues to criterion, including additive linear, additive nonlinear,multiplicative linear, and multiplicative nonlinear. These threeexperiments produced a total of 238 individual sequences offorecasts to which we fit each learning model.

2.1. Orange juice futures price forecasting

Kelley (1998) and Kelley and Friedman (2003, 2004)presented a linear-stochastic repeated trial individual-choiceexperiment, denoted Orange Juice Futures (OJ) forecasting, to130 U.C. Santa Cruz psychology subject pool undergraduatesduring 1996 to 1997. Participation in this experimentpartially satisfied an experimental requirement for introductorypsychology classes. These studies consider a subject’s abilityto forecast, ft , a stochastic asset market price change target,pt , given i = 2 independent determinants of the price, x1,t andx2,t . By learning to forecast this asset price change, the subjectswould be implicitly learning the weights to attach to each of thestimuli cues presented.

Methodology. This was a computerized experiment usinga graphics computer program written in C++, run on powerMacintosh 7500/100 computers with color monitors. Subjectsin four sound dampened isolated testing rooms view controlledevents on the monitor screen and respond via clicking themouse on various icons on the display.

Participants were first provided a three page instructionalhandout, which included a graphical representation andexplanation of the screens they would observe. There, subjectsare told that pt refers to the local orange juice futures pricechange, that x1,t refers to the time t local weather hazard whichcould potentially destroy all or part of the domestic orangeproduction, and that x2,t refers to the time t competing supplyof oranges from Brazil. Participants were asked to forecast thelinear-stochastic target price pt , only observing the two stimulixi t and post response feedback, for 480 repeated trials. The OJexperiment training function had the following linear-additive-stochastic form:

pt = aT· xt + εt = a1x1t + a2x2t + εt . (1)

The vector of objective parameters a is initially unknownand any knowledge about these weights must be learnedfrom experience. The noise term ε ∼ U [v, −v] reflects theunpredictably of asset prices in field markets. Its value εt isdrawn independently in each trial from the uniform distributionon [−v, v], where v is the maximum noise value and is atreatment variable (approximately 8 in the baseline treatment).The realized target price change pt in trial t depends on therealized value of x1,t ∈ [1, 100] and its scalar coefficient a1(approximately 0.4 in the baseline treatment), and on x2,t ∈

[1, 100] and its scalar coefficient a2 (approximately −0.4 inthe baseline treatment), as well as the noise component. Thecoefficient signs reflect the economic reality that destruction ofdomestic crops or increased foreign supply would cause priceincreases or decreases, respectively.

Participants were asked to forecast the price change fort = 480 consecutive trials, with two scheduled 5 min breaks.The experiment generally lasted 1.5–2.0 h. Their performancefeedback included the actual price and a rating calculated asthe squared deviation of the price change forecast ft from thecurrent target price change, pt .

A total of seven experimental treatment manipulations areprovided and are called: Baseline, Paid, No Score, No History,High Noise, Asymmetric, and Structural Break. However, dueto their weak effects and similarity to the Baseline treatment,the Paid, No Score and No History treatments are pooledtogether with the Baseline data. The treatments described afterthe Baseline case below are ordered in increasing anticipateddifficulty for making accurate price forecasts. See Kelley andFriedman (2003) for a discussion of this task difficulty ranking.

Baseline. The first treatment asks participants to forecast theorange juice price change from the unknown function Eq. (1).The unknown coefficient values are a = (0.417, −0.417)T andv ∈ (−8.3, 8.3) giving a signal-to-noise ratio of 83% stimulito 17% noise signal. Participants observed the two independentweather and supply cues, feedback on the historical sequenceof stimuli and outcomes observed to date, and a score relatingthe current trial and cumulative forecast accuracy. We call thisa symmetric weights treatment since the stimuli have equalabsolute influence on the price change. Subjects also haveaccess to summary information describing past stimuli andtarget realizations and information describing their cumulativeforecast accuracy score.

Paid. This treatment differs from baseline in that participantsare paid according to their final scores. Each participantreceives a $5.00 show up fee covering the first 30,000 pointsof final cumulative score. (Actual final scores always exceeded30,000 with the top scores over 37,000.) Participants alsoreceive an additional dollar for each 700 points scored above30,000. The median payment was about $15.00 with toppayments about $16.50. Participants are told the paymentprocedures on arrival.

No Score. This treatment differs from baseline in thatparticipants do not have access to information describingtheir cumulative forecast accuracy score. Participants still haveaccess to summary information describing past stimuli andtarget realizations.

Page 6: A comparison of models for learning how to dynamically integrate multiple cues in order to forecast continuous criteria

H. Kelley, J. Busemeyer / Journal of Mathematical Psychology 52 (2008) 218–240 223

No History. This treatment differs from baseline inthat participants do not have access summary informationdescribing past stimuli and target realizations. Participants dohave access to information describing their cumulative forecastaccuracy score.

High Noise. The unknown coefficient values for thistreatment are a = (0.357, −0.357)T with v ∈ (−14.3, 14.3).The impacts of these changes are that there is an increase inthe noise variance and a reduction in the impact of the observedstimuli. This implies a smaller signal-to-noise ratio of to 71%signal to 29% noise.

Asymmetry. The unknown coefficient values for thistreatment are a = (0.250, −0.583)T with v ∈ (−8.3, 8.3). Thismanipulation introduces asymmetry in the absolute validity ofthe weather and supply cues. The OJ price still evolves asa function of the two stimuli; however, the second stimuli,competing supply, is now a more valid predictor of the futureOJ price.

Structural Break. This treatment manipulation introducesa once and for all change in the vector of weights a, at thehalfway point of the experiment. Participants are informed inthe instructions that this may occur. The first set of stimuli andvariance weights are obtained from the Baseline case, then attrial t = 241 the weights are changed to those given in theAsymmetry treatment.

2.2. Mutual fund price forecasting

The second data set collected during 2000, denotedMutual Fund Price Forecasting (MFPF), presents a repeatedtrials linear-stochastic function learning task to 54 IndianaUniversity, Bloomington paid undergraduate participant poolparticipants. Participants in this study partially fulfilled anexperimental requirement for their introductory psychologyclasses. Participants are asked to repeatedly forecast a mutualfund price change pt based on i = 5 independent stock pricechange stimuli, xi,t .

Methodology. This was a computerized experiment whereall stimuli and responses were collected by a graphical BorlandC++ program. Each participant controlled events by pressingthe arrow keys, the spacebar, and number keys on the keyboard.The experiment was run on IBM computers in an MSDOSenvironment with individual-participant cubicles, with up to 6cubicles being simultaneously occupied in the laboratory. Theexperiment lasted 1.5–2.0 h with one unscheduled break.

In the experiment, participants are first presented with athree page instructional handout, which included a graphicalrepresentation and explanation of the screens they wouldobserve. They are asked to forecast a mutual fund price changept after observing 5 stimuli, xi,t , for each of 480 repeated trials.These stimuli values are graphically represented as a verticalslide rule with zero price change at the midpoint. See Fig. 5afor a representation of the interface subjects used to observestimuli and provide their continuous forecast. They are also toldthat the price change they forecast is influenced by a randomcomponent, εt , reflecting the unpredictably of field prices. The

stimuli in the mutual fund experiment are combined to providean outcome with the additive-linear-stochastic function (2).

pt = aT· xt + εt =

∑i=1,5

(ai xi t ) + εt . (2)

This function is comparable to rules used in the earlier OJexperiment, with the exception of the additional number ofcues. The realized mutual fund price change pt at trial t isconstructed with the realized values of xi,t ∈ [−50, 50] andtheir coefficients summarized in the vector a. The weights areall positive reflecting the economic reality that an increase(decrease) in the scalar value of the underlying stocks xi,t willincrease (decrease) the mutual fund price. The noise term εt isdrawn independently each trial from the uniform distribution[−v, v], with v representing the maximum noise signal.

A total of three experimental treatment manipulations areprovided including: Baseline, High Noise, and Asymmetric.They are designed to determine if forecasting behavior isinfluenced by an interaction between the number of cues aparticipant must process and these treatment manipulations.The manipulations are ordered in increasing anticipateddifficulty and we consider these treatments to be more difficultthan those from the OJ experiment due to the increased numberof cues.

Baseline. The first treatment asks subjects to forecastthe target price from the unknown function Eq. (2) withv ∈ (−10, 10). The unknown coefficient values are a =

(0.18, 0.18, 0.18, 0.18, 0.18)T . We also call this a symmetricweights treatment since the five stock stimuli have equalabsolute influence on the mutual fund price change. For thislevel of variance, the stimuli represent 90% of the signal withonly 10% noise.

High noise. The unknown coefficient values for the secondtreatment are a = (0.15, 0.15, 0.15, 0.15, 0.15)T with v ∈

(−20, 20). This variation again reduces the signal-to-noise ratiofrom the stimuli to 80% signal and 20% noise.

Asymmetric. The unknown coefficient values for the thirdtreatment are a = (0.05, 0.10, 0.15, 0.3, 0.4)T with v ∈

(−10, 10). This introduces asymmetry in the cue validity.

2.3. Medical prediction task

The third experiment referred to as the ROE data, presents adeterministic repeated trial individual-choice function learningexperiment to 53 Purdue University psychology subject poolundergraduates. These data were collected during 1996.This study partially fulfilled the participants’ experimentalrequirement for an introductory psychology course. Theexperiment explored subjects’ ability to learn and provide aforecast, ft , of the change in an unknown outcome, pt , acrosstime given two cues, x1,t , x2,t .

Methodology. The study was conducted in a laboratoryequipped with twelve IBM compatible computers. Participantssat and responded by using the arrow, space, and enter keys.A C++ computer program controlled the presentation of thestimuli and collected participant responses. At the start ofthe experiment participants were asked to read two pages of

Page 7: A comparison of models for learning how to dynamically integrate multiple cues in order to forecast continuous criteria

224 H. Kelley, J. Busemeyer / Journal of Mathematical Psychology 52 (2008) 218–240

instructions explaining the task. They were asked to learn andforecast the fictional relationship between the levels of twounknown substances and the amount of arousal they cause inan unknown person using feedback as a guide. The instructionsalso provided is an example of the interface; see Fig. 5b for arepresentation.

The experiment involved four function/treatment groups intowhich subjects were randomly assigned. These included anadditive linear (AL) condition, an additive nonlinear (ANL)condition, a multiplicative linear (ML) condition, and amultiplicative nonlinear (MNL) condition; see below. Theselast functions are particularly relevant for a study of economicforecasting methods due to the ubiquitous nature of nonlinearrelationships in the economics realm. Also, the forms weinvestigate represent a broad set of the nonlinear forms relevantto economic forecasting, see MasCollel et al. (1995).

In all four treatments participants provide forecasts for atotal of 350 repeated trials. The range of the possible stimulusmagnitudes for training were x1,t , x2,t ∈ [6, 8, 10, 12, 14].Treating each cue as a factor with 5 levels, these factors werecrossed to form a factorial design of 25 pairs of cue values forthe training. These 25 pairs were randomly sampled for a totalof 350 trials. The range of possible time t forecast predictionswas ft ∈ [0, 100] in steps of 1. Once participants completed thetraining phase of the experiment they were given instructionsstating that the task would change and that they would notreceive any feedback on the following ‘transfer’ trials (201 ≤

t ≤ 350). The transfer phase of the experiment consistedof trials where participants were presented novel input valuesnow including stimuli values for x1,t , x2,t of [2, 4, 16, 18] inaddition to the stimuli values presented during training. Thus,the stimuli set now included all possible combinations ofstimuli across x1,t , x2,t ∈ [2, 4, 6, 8, 10, 12, 14, 16, 18]. Theorder of the presentation of the transfer trials was randomizedfor each participant. Transfer trials proceeded exactly astraining trials except that feedback was not provided.

For both training and transfer blocks the correct responsesto be learned by subjects, pt , were calculated with the fourfunctional equations represented by Eqs. (3)–(6). These alsorepresent the four treatments for the experiment. All functionswere deterministic functions of xt which distinguishes themfrom the earlier two experiments.

Additive linear. The first treatment presented subjects with asimple deterministic and additive linear function.

pt = a1x1,t + a2x2,t . (3)

There were two equally valued coefficients to be learned, a1 =

a2 = 2.5.Additive nonlinear. This treatment presented subjects with a

deterministic function with an additive-nonlinear, i.e., quadraticstructure.

pt = a1x21,t + a2x2

2,t . (4)

Again, there were two equally valued coefficients to be learned,a1 = a2 = 0.125.

Multiplicative linear. This treatment presented subjects witha deterministic but now multiplicative-linear structure.

pt = a1x1,t x2,t . (5)

Due to the multiplicative structure in this treatment there wasonly one coefficient to be learned, a1 = 0.25.

Multiplicative nonlinear. This treatment presented subjectswith a deterministic and multiplicative nonlinear structure.

pt = a1√

x1,t√

x2,t . (6)

Again, due to the multiplicative structure in this treatment therewas only one coefficient to be learned, a1 = 5.0.

3. Learning models

Five learning model variants are evaluated, two motivated bythe economics literature and three motivated by the cognitivepsychology literature. The two motivated by the economicsliterature, e.g. Marcet and Sargent (1989a,b,c), use standardpolynomial rules to make forecasts and standard statisticaltechniques (i.e., the normal equations) to dynamically obtainleast-squares estimates of the functions’ a parameters on a trial-by-trial basis. One variant, henceforth called model N-L, usesthe normal equations applied to all (a long window) of thepast observations when estimating parameters; another variant,henceforth called model N-S, uses the normal equations ononly a short or truncated window of recent past observationsto estimate the parameters. The first two variants of modelsmotivated by the cognitive literature also use the same ruleas models N-L and N-S to make forecasts (polynomial rule-based regression model). However, they use a delta learningrule see Gluck and Bower (1988), Rescorla and Wagner (1972)and Stone (1986) to sequentially update the estimates of theparameters a, conditional upon a free parameter learning rate,rather than using standard statistical formulas. In one case,henceforth called model D-C, the estimated learning rate for thedelta learning model is assumed constant across trials; and inanother case, henceforth called model D-D, a decay of learningrate parameter is also estimated allowing the learning rate tovary. The third model motivated by the cognitive literature isa radial basis type of connectionist learning model originallydeveloped by Busemeyer et al. (1993), which they call theassociative learning model or ALM model. This model differsfrom model D-C and D-D in the form of representation used tomake forecasts and to radially update network weights, but it issimilar in terms of the use of a delta learning rule to updateparameters a. As will be described below, our ALM modelis nearly equivalent to the EXAM model proposed by Deloshet al. (1997). The key differences have to do with the formationof model predictions and with extrapolation. With the normalrules, extrapolation is quite straightforward since the modelsare estimating subjective weights for a function, and one needsimply cross weights with stimuli. Except for the precision ofestimates, whether or not the stimuli are novel is unimportant.For the delta rules and ALM, extrapolation is allowed becauseboth the input and output ranges are normalized to (0, 1). Whenthe model calculates a prediction it is a weighted average based

Page 8: A comparison of models for learning how to dynamically integrate multiple cues in order to forecast continuous criteria

H. Kelley, J. Busemeyer / Journal of Mathematical Psychology 52 (2008) 218–240 225

upon the inner product among weights and input activation. Theprediction allows extrapolation because we assume that inputsmay be related to outputs across this entire normalized range.This is an alternative to EXAM’s linear interpolation techniquedescribed in Kalish et al. (2004). A weakness is that one mustpredefine extremes for both input and output ranges.

Some common notation will be used to describe all of themodels. Recall that the cue values presented on each trial aredenoted by a column vector xt , and the target criterion orfeedback presented on each trial is denoted pt . A participant’sforecast on that same trial is denoted ft . Finally, the forecastgenerated by a model for that trial will be denoted as gt .The discrepancy (pt − ft ) is the difference between the targetcriterion and the participant’s forecast, which is presumablywhat the participant used to learn to improve his or herpredictions on future trials. The discrepancy (pt − gt ) isthe difference between the target criterion and the model’sprediction, which is what the model used to improve itspredictions on future trials. Most importantly, ( ft − gt ) is thedifference between the participant’s forecast and the model’sprediction. We are mainly interested in the latter discrepancy,which is used to measure goodness-of-fit and to evaluate themodels ability to explain a participant’s behavior. In applyingthese models we are most interested in determining the bestrepresentation of subjects’ responses, i.e. minimizing this latterdiscrepancy, and determining if the best model varies withfunctional form.

3.1. Model N-L: Normal equations applied to all pastobservations

Models N-L and N-S both used polynomial regressionmodels to generate predictions on each trial. Recall that boththe Orange Juice price forecasting experiment and the medicalprediction experiment presented only two cues. The model usedfor both of these experiments was a second-order polynomialregression model of the form

gt = α1 · x1t + α2 · x2t + α3 · x21t + α4 · x2

2t + α5 · x1t · x2t

+ α6 · x21t · x2t + α7 · x1t · x2

2t + α8 · x21t · x2

2t . (7)

Although the orange juice forecasting experiment did notinvolve any nonlinear or non-additive terms, the subjects didnot know this, and we did not wish to impose this a priori.

The mutual fund experiment used 5 cues, which makes thesecond-order polynomial too complex, and so in this case, wesimply used the linear additive model:

gt = α1 · x1t + α2 · x2t + α3 · x3t + α4 · x4t + α5 · x5t . (8)

The column vector of subjective weights, α = [α1, . . . , αk]T

was estimated using the solution to the normal equations basedon the observations ranging from trial 1 to the current trial t asfollows. First we formed a t×k matrix Z1:t with the k predictorsforming the columns, and the values of these predictors for ttrials forming the rows. In the case of Eq. (7), there are k = 9columns of predictors, and in the case of Eq. (8), there arek = 5 predictors. The target criterion values from trials 1 tot are collected in the vector p1:t .

The sum of squared prediction errors between the targetcriterion and the model forecast, Σ (pt − gt )

2, for trial 1 tot is minimized by selecting the weights that solve the normalequations (ZT

1:t Z1:t )αt = ZT1:t p1:t . The solution to the normal

equations was then used to obtain the parameter estimatesvector α used to make forecasts for the next trial:

αt = (ZT1:t Z1:t )

−1ZT1:t p1:t . (9)

Z1:t indicates that trials 1 to t were used to estimate theparameters used for the model forecast at trial t + 1; that is,gt+1 was based on the weights estimated from trial t , i.e. αt .

Obviously, this is just a standard statistical regression modelfor making predictions. This would also be considered aBayesian forecast under diffuse priors, i.e. combination ofpriors and likelihood values in order to obtain a posteriorprobability of an outcome given observed stimuli, (Marcet &Sargent, 1989a,b,c). The model N-L has no free parameters.The forecasts from the model are completely determined fromthe cue values and true criterion values presented by theexperimenter. This model assumes perfect memory regardingall past observations and no bias in the way that stimuli areintegrated, and thus provides an objective upper bound forlearning performance. Although it may be implausible to thinkhumans can perform at a Bayesian level of efficiency, it isinteresting to compare the extent to which actual subjects, andthe predictions of the other fitted models, deviate from thisperformance limit.

3.2. Model N-S: Normal equations on a recent short window ofpast observations

The next model is made slightly more psychologicallyplausible by limiting the memory of the model. The onlydifference between model N-L and N-S is the number ofpast observations included in the estimation of the subjectiveparameters used to forecast on each trial. For this model, onlythe recently experienced cue values from trials [t, (t − 1),

(t − 2), . . . , (t − win)] were included into the predictor matrixto form a matrix with k columns and win + 1 rows, Z(t−w):t ,where win is the window size of the lag extending backwardin time. The normal equations were again used to estimate theweights:

αt = (ZT(t−win):t Z(t−win):t )

−1ZT(t−win):t p1:t . (10)

For this model, the forecast on the next trial, gt+1, is based onthe weights estimated from a recent window of experience thusthe model incorporates time myopia.

This model allows the window size parameter win to befreely estimated so as to minimize the lack of fit betweenthe model prediction and the subject forecasts, Σ ( ft − gt )

2.As a result, model N-S has one free model parameter, win,characterizing the degree of recency. Intuitively, the modelcaptures the amount of information that a participant might beable to remember. One might expect that the estimated windowsize would be smaller for experiments with a larger numberof stimuli dimensions if a memory constraint is binding.The win parameter is estimated separately, up to a maximum

Page 9: A comparison of models for learning how to dynamically integrate multiple cues in order to forecast continuous criteria

226 H. Kelley, J. Busemeyer / Journal of Mathematical Psychology 52 (2008) 218–240

window length of 330 trials for each person’s data, to allow forindividual differences in memory. As we impose a maximumlength it is possible for the N-L model to outperform the N-Smodel if early trials are especially important for a subject.

3.3. Model D-C: Delta learning rule with a fixed learning rate

The next two models also use the same rule-basedpolynomial formulas to make predictions (Eqs. (7) and (8)).The only difference between the next two models and theprevious two is in the manner in which the subjective parameterestimates used to make predictions from the polynomial, αt , areupdated. Rather than using the normal equations to estimate theweights, a delta learning rule is used instead see Stone (1986).For the current model, D-C, the learning rate for the delta ruleis assumed to be constant.

Define zt as a column vector containing the valuesof the predictors computed from the cues on trial t .For example, for the medical prediction task, zt =

[x1t , x2t , x21t , x2

2t , x1t x2t , x21t x2t , x1t x2

2t , x21t x

22t ]

T. The delta ruleupdate of subjective weights on trial t is:

αt = αt−1 + δ(pt−1 − gt−1)zt−1. (11)

δ represents the one free model parameter for describinga participant’s specific constant learning rate. The termin parentheses captures the model’s prediction error signalfrom previous trial’s scalar prediction and the actualscalar price change. This error signal is multiplied by theadjustment/learning rate δ and the appropriate element in zt−1to obtain an association weight αt for the appropriate elementof z. These weights again describe the model’s estimate of theparticipant’s perceived relation between targets and predictors.Similar to other models, the forecast on the next trial, gt+1, iscomputed by inserting the trial t weights from Eq. (11), intothe polynomial regression formulas. Note that this produces aunique sequence of trial-by-trial weights compared to N-L andN-S.

A constant learning rate is the most commonly used form ofthe delta learning model. This model shares some similaritiesto model N-S. When a constant learning rate is used, recentlyexperienced observations have more impact on the currentweight estimates and distant past observations are forgotten. Infact, the impact of an observation decreases exponentially as afunction of its age or lag back in time.

During model fitting, we optimize the one free parameterof the model δ so that the lack of fit between the model’spredictions and the subject’s forecasts, Σ ( ft − gt )

2, isminimized. Importantly, it may be the case that individualsubjects adjust to forecast errors, i.e. learn, at different rates.In order to capture these potentially heterogeneous learningrates the free parameter δ is fit to each participant’s responsesseparately.

3.4. Model D-D: Delta learning rule with a decaying learningrate

It might be the case that subjects’ rate of learning decayswith experience. The only difference between model D-D and

model D-C is that the learning rate decreases with training asfollows:

αt = αt−1 + (δ/tω)(pt−1 − gt−1)zt−1. (12)

In this equation, the parameter ω controls the rate of decayof the learning rate. This model has two free model parameters,δ and ω, that can be used to minimize the lack of fit betweenthe model’s predictions and the participants’ forecasts. Theseparameters were estimated by minimizing the sum of squarederrors Σ ( ft − gt )

2 for each individual to allow heterogeneity inlearning and decay rates. To obtain predictions of participants’forecasts from this model again requires the time t vector ofsubjective weight estimates, given by Eq. (12), to be combinedwith time t predictor vector zt using Eq. (7) or (8).

Note that the D-D model is similar in some characteristicsto the N-L model but also displays important differences.First, both models attempt to reduce the sum of squarederror between the model prediction and the observed subjectresponse. Second, both models share the property that astraining progresses each new observation has a decreasingeffect on the new weight estimates. In fact, convergencetheorems, which prove that the delta learning rule convergesasymptotically to the least-squares estimate, e.g. White (1989),requires the learning rate to decrease with training. Importantly,however, the delta rule with both constant and exponentiallydecaying learning rate displays distinct prediction dynamicsrelative to the normal equations. This is because neither theconstant nor exponentially declining adjustment rates of thedelta rules match the declining adjustment rate of the normalequations with a growing sample.

3.5. ALM: Associative learning model

The final model considered is an associative connectionistnetwork model extended to allow analysis of continuousresponse data by Busemeyer et al. (1993). Similar to the D-Cand D-D models, the ALM model employs a delta parameterlearning rule. However, unlike all of the previous models, theALM model does not employ a regression rule to computepredictions, and instead, it employs a radial basis networkrepresentation to generate its forecasts. First, consider theapplication of the ALM model to the OJ and ROE medicalprediction tasks, which involved only two cues.

ALM posits an input layer that processes the two cues, andthis input layer is in the form of a two-dimensional grid. Thefirst dimension of the grid represents the first cue by a setof equally spaced values {χ1i , i = 1, . . . , b}; and the seconddimension of the grid represents the second cue by another setof equally spaced values {χ2, j , j = 1, . . . , b}. In particular, weset b = 21 thus dividing the entire range for each cue into 21equally spaced values. Crossing these two sets forms the gridor matrix containing b2 input node values. The input nodes aredenoted χ i j = (χ1,i , χ2, j ) for i = 1, . . . , b and j = 1, . . . , b.

When a particular stimulus is presented, it activates eachpossible input node according to its similarity to the currentcue values. That is, each of the 212 nodes may be activated, butthe node actually containing the current stimulus combination

Page 10: A comparison of models for learning how to dynamically integrate multiple cues in order to forecast continuous criteria

H. Kelley, J. Busemeyer / Journal of Mathematical Psychology 52 (2008) 218–240 227

is activated the most. Below, Ai j (xt ) represents the activation,at time t , of the (i, j)th input node χ i j when the stimulusxt = (x1,t , x2,t ) is presented. The Gaussian activation functionEq. (13) provides activation of input nodes as a decliningfunction of the distance of a given input node from the binactually containing the true stimuli value.

Ai j (xt ) = exp{−γ [(χ1i − x1,t )2+ (χ2 j − x2,t )

2]}. (13)

Here, χ i j denotes one of the b2 possible input nodecombinations of the 2 stimuli with i, j = 1 to 21 categories foreach stimuli. Further, xi t represents the observed time t inputstimuli for the i th dimension, and γ is a constant describingthe slope of the activation gradient. In this expression γ definesboth input and output activation and determines the gradientdifferential between the maximum and minimum activationlevels. Note, that for the earlier delta rules only the value equalto the observed stimuli is activated, so rather than a bell there isonly a single activation point.

ALM also posits an output layer ρk , which is in the form ofa one-dimensional grid of points. The output layer representsthe criterion target by a set of equally spaced values, {ρk, k =

1, . . . , b}, and once again b = 21 in this application.Finally, ALM postulates a set of association weights,

αi j,k(t), that associates each input node χi j to each outputnode ρk . The activation of output node ρk , which is denotedOk(xt ) given cues, xt , is obtained by multiplying each inputnode activation Ai j (xt ) with the current association weight,αi j,k(t), and summing across input nodes. The scalar result,Ok(xt ) summarizes the model’s time t activation of a particularoutput node ρk . Eq. (14) describes how we obtain the activationfor an output node,

Ok(xt ) =

∑i

∑j

αi j,k(t)Ai j (xt ). (14)

According to ALM, the probability of choosing to predict thevalue ft = ρk is given by the ratio of strength rule:

Pr[ ft = ρk] = Ok(xt )

/∑k

Ok(xt ). (15)

Note, that this model can only output a value from one of thegrid points used to represent the criterion target, i.e. providesa scalar criterion prediction. The above equation produces aprobability distribution of possible outputs on each trial. Toconvert this distribution into a mean, the probability of eachoutput value is multiplied by its corresponding value to producethe expected output value to a given pair of cues:

gt = ΣkρkPr[ ft = ρk]. (16)

The association weights are updated according to a deltalearning rule as follows. First, when the target criterion ispresented, pt, this activates the set of output nodes to produce afeedback pattern of activation based on a Gaussian distributioncentered on the correct target value. The feedback targetactivation of node ρk , denoted Fk(pt ) is given by,

Fk(pt ) = exp{−γ [ρk − pt ]2}. (17)

The connection weights αi j,k are learned each trial by usingthe feedback output node activation F, the predicted outputactivation O, and the input node activation A:

αi j,k(t) = αi j,k(t − 1) + δ[Fk,(pt ) − Ok(xt )]Ai j,k(xt ). (18)

This model has 2 free parameters that are fit to individualsubjects by minimizing Σ ( ft − gt )

2, the learning rate δ andthe activation gradient γ .

This model is structurally equivalent to the EXAM modelof Busemeyer et al. (1993) except for the rule used to obtaina model forecast. By using the deterministic weighted-averagerule at Eq. (16) rather than the probabilistic choice rule used byEXAM, we are able to eliminate one source of prediction noise.This increases our parameter precision while still allowing theALM model to extrapolate as well as EXAM.

This model encounters a problem with the mutual fundexperiment which involves 5 cues. With this many cues it is notfeasible to form a five-dimensional grid space of input nodes.Thus, as we did with all of the earlier models, we neededto make a simplifying assumption for the 5 cue experiments.In this case, we assumed that each cue was represented bya single dimension of 21 grid points, but we did not attemptto cross the five sets. Instead, each of the 5 cues produced5 separate sets of input nodes. Each set of input nodes wasassociated with the output nodes by a separate set of associationweights. The activation of each output node was computedby summing the weighted activations of all input nodes. Weused the same output node activation comparison for generatingforecasts and the delta rule for updating association weightsgiven feedback. This simplification of course affects the abilityof this model to provide predictions. The full implications ofthis would be topics for future simulation analysis. However,compressing the multiple stimuli activation dimensions, whereeach stimulus has an activation bump on one line, into oneline with multiple bumps has some obvious implications. In theabsence of compression, a model learns about each stimulusindividually and updates a unique weight. When stimuli arecombined into one dimension, the model learns about thestimuli as a group. Thus, less is learned about each individualstimulus, what is learned is obtained from a noisier signal,and the model implicitly assumes each stimulus has an equalimpact or weight. This could mean that asymmetric weightsand many cues will be a problem for ALM relative to othermodels if this simplification is inappropriate. Further, thiscompression technique is likely to run into problems if thestimuli were not independent. This interdependence would needto be incorporated into the stimuli compression method. Notethat all stimuli are all independent.

3.6. Model and experimental task distinguishability

We next compare the performance of the optimal learner(N-L) across a subset of the experimental tasks. Thisaddresses questions regarding the distinguishability of ourexperimental tasks for a baseline learner. We next quantifyhow distinguishable each model’s predictions are for similartasks. This demonstrates models’ within-treatment statistical

Page 11: A comparison of models for learning how to dynamically integrate multiple cues in order to forecast continuous criteria

228 H. Kelley, J. Busemeyer / Journal of Mathematical Psychology 52 (2008) 218–240

identifiability. For these simulation exercises we employ anarbitrary set of parameters for the models. In fact, we usethe median of the subjects estimated parameters; however, atthis point in our discussion these are simply arbitrary. Note,that any statistical technique focusing on a central tendency,such as Wilcoxon approach used here, under represents modeldifferences when they predict a common target but havedifferent early-trials dynamics. Intuitively, such a test asks ifthe models make unique predictions at the midpoint of anexperiment, which is only part of the relevant question. Theresults provided immediately below give a general idea ofdistinguishability of tasks and models but they ignore importantdifferences in the models’ early sequences of predictions, whichwe address with more detailed analysis later.

We first compare the optimal learner’s forecasts acrossfour key treatments, OJ symmetric weights, OJ asymmetricweights, MFPF symmetric weights, and Roe multiplicativenonlinear function, similar to the cases described in Fig. 4.When comparing the forecasts of the optimal learner acrossthese four tasks, a Wilcoxon rank sum test rejects the nullhypothesis that the forecasts are equal at less than the 2% levelfor all pairwise comparisons. In other words, the optimal learnermakes distinguishable predictions for each task at the medianpoint, which implies that these tasks are unique.

Comparing all models’ predictions within treatments tells usabout the distinguishability of the models for each task. For theOJ symmetric weight tasks the rank sum test rejects the nullof equivalent forecasts only for the ALM model compared tothe others. Thus, for the median-trial prediction in this task, theother models are not distinguishable. For the OJ asymmetricweight tasks, the rank sum test rejects the null of equivalentforecasts for all except the long and short window normalequations in pairwise comparisons. For the MFPF symmetricweight tasks the rank sum test reject the null only for theALM model compared to the others. For the ROE nonlinearmultiplicative task, the rank sum test rejects the null for allexcept the long and short window normal equations. Despite theapparent similarity of median-trial predictions in the symmetricweight tasks or for normal equation models, for each task atleast one model’s prediction is distinct at the 1% level.

3.7. Model comparison methodology

The models being compared in this article are, in general,not nested and they differ in terms of number of parameters.Model N-L has no free model parameters to be estimated fromthe data; model N-S has one free parameter, the window sizewin; model D-C has one free parameter, the constant learningrate δ; model D-D has two free model parameters, the learningrate δ and the learning decay rate ω; and finally ALM has 2 freeparameters, the learning rate δ and the activation gradient γ .

We chose to perform our model comparison using a measureof fit that is standard for the forecasting literature in economics,sum of squared prediction errors (which corresponds tothe log likelihood under the assumption of normality forthe responses). In particular, for each model and for eachparticipant, we searched for parameters that minimized the sum

of squared prediction errors between the subject’s forecast andthe model’s prediction:

SSE = Σt=1,T ( ft − gt )2 (19)

where T is the total number of trials in an experimentalcondition. The SSE was converted into an R2 by the lineartransformation:

R2= 1 − (SSE/TSS) (20)

where TSS = Σt=1,T ( ft − m)2, and the mean m =

Σt=1,T ft/T .To accommodate the fact that the models differ in terms

of number of parameters, we penalize for the number ofparameters by also using a Bayesian information criterion index(henceforth BIC). If we assume that ft is normally distributedwith constant variance, then the BIC index for a model can beexpressed in terms of the R2 as follows. In this case, G2

=

−2(log likelihood) = N · [ln(SSE/N ) + ln(2π) + 1]. UsingSSE = (1 − R2) · TSS with TSS the total sum of squareddeviations around the mean, then G2

= N · {ln[TSS(1 −

R2)/N ] + 2π + 1] = N · ln(1 − R2) + k, where k =

N · [ln(TSS/N ) + ln(2π) + 1] is a constant that is commonacross all models being compared, and it can be ignored. Then,for model i , BICi = −[G2

i + npi · ln(N )] = −[N · ln(1 −

R2i ) + k · ln(N )], where npi is the number of parameters in

the model see Wasserman (2000). We chose the model with thelarger BIC, i.e. is less negative. Therefore, we choose a morecomplex model 2 over a simpler model 1 only if,

BIC2 > BIC1 → −[N · ln(1 − R22) + k + np2 · ln(N )]

> −[N · ln(1 − R21) + k + np1 · ln(N )]

→ −[N · ln(1 − R21) − N · ln(1 − R2

2)]

> (np2 − np1) · ln(N ) = c · ln(N )

→ ln[(1 − R21)/(1 − R2

2)] > [c · ln(N )/N ]

→ (1 − R21)/(1 − R2

2) > N (c/N )

where c = (np2 − np1) is the difference in the number ofparameters for the two models.

For several alternative approaches to the problem ofcomparing non-nested models see the Journal of MathematicalPsychology 2000 special model selection issue.

4. Results

Figs. 6a and 6b provide a general description of our MFPFand ROE continuous criterion forecast experiment data. TheOJ data are described in Kelley and Friedman (2004). In thesescatter plots a point is an individual-subjects’ response versusthe actual value coordinate. All subjects and treatments for eachexperiment are pooled together. Points lying on the solid lineare exactly correct forecasts; points lying within the dotted linesare forecasts within 2 standard deviations of the criterion value.

In Fig. 6a for the MFPF data we see a wide dispersion ofsubject forecasts, however, the distribution of forecasts for eachtarget value are centered around the correct value. This can alsobe confirmed by observing how most of the points track the

Page 12: A comparison of models for learning how to dynamically integrate multiple cues in order to forecast continuous criteria

H. Kelley, J. Busemeyer / Journal of Mathematical Psychology 52 (2008) 218–240 229

1 Note that these totals may not exactly match the totals for the experiment.Some of our 238 subjects became unresponsive, so no model could provide andaccurate fit. Details are reported at Tables 1 and 3.

area between the dotted 2 standard deviation lines. However,there are many points that are statically different, i.e. outsidethese lines, which suggest there is a range of heterogeneousbehavior to be explained. In Fig. 6b we see similar data forthe ROE experiment. One obvious difference is horizontalgaps representing criterion values not produced by the stimulicombination in the number of available trials. Again we seea central tendency of forecasts around statistically-correct-forecast area. However, there are again many heterogeneouspredictions to be explained. In both panels we can get a senseof the extrapolation subjects were asked to perform. Duringthe trials of an experiment, each time they observe stimuli thatrequires a forecast outside their previous forecast range, whichoccurs often, the subject is asked to perform an extrapolation.One can get a simple sense of this by considering the maximaland minimal criterion values in the figures. We see that thesubjects are able to extrapolate well and the central tendencyof these forecasts is indeed centered about the correct value asmuch as less extreme points.

We present the specific results of our analysis in threesubsections. First, we test if the models’ abilities to accuratelydescribe subjects’ responses vary by task group. To dothis we compare the models’ estimated R2s and Bayesianinformation criterion values for each individual-subject lookingfor the best; and, we briefly discuss general empirical featuresof the models’ estimated parameters. Second, we presentevidence suggesting subject heterogeneity. Finally, we explorethe influence of task difficulty and the overall predictiveperformance of each model across individual subjects andtreatment groups.

4.1. Model comparison

To investigate the task specificity of learning, we performa model comparison for each model for each subject acrossthe function task groups. Note, due to minimal treatmenteffects across the orange juice baseline, paid, no score, andno history treatments, this data is pooled together into thebaseline group, see Kelley and Friedman (2003). The empiricalquestion of interest is whether the best external descriptionof subjects’ responses systematically varies by task. If theexternally observed characteristics of subjects’ learning areindeed task specific, and each class of learning model capturesdifferent aspects of learning, one should observe the mostaccurate descriptive model also varying by task. If instead thecharacteristics of learning are task independent, one modelshould dominate across all tasks and subjects.

Empirical evidence quantifying the extent of by-task modelperformance variation is provided in Tables 1–4. Additionalinformation reported includes: (1) The number of freeparameters, np, for each model. (2) The sizes of the samples,N , which with the previous information allows one to infer thatthe power of our study is high. (3) The magnitude of modelprediction performance differentials. Also, (4) The model thatprovides the best median R2 (BIC) or frequency of maximalR2s (BICs) for a task has its values reported within a box. And,(5) Median R2s (BICs) for models that were not significantly

different at the 10% level from the best performing model areunderlined.

The upper most rows of the tables label the experiment andthen the specific task facing a group of subjects. Below theselabels, the rows of the table indicate which model is fit and thecolumns represent the function learning tasks that we present tosubjects. Within each cell (1 model-row x 1 task-column) wereport results of the fitting the indicated model to the indicatedgroup of subjects. We report two measures of performance, amedian R2 or BIC value and frequency measures. The firstentry is either the median of the models’ R2 or BIC acrossthe individual subjects within a task group. Then, separated bya comma, we report the number of times the indicated modelprovided the maximal R2 (minimal BIC) for the individualsubjects within this group. This latter item is our frequency-based measure of performance. Summing the frequenciesvertically in any given column gives the total number ofsubjects in that treatment task.1 The number of subjects in eachtask is indicated by the Total Subjects row and the totals forthe experiments are indicated by the tally without parentheses.Finally, paired Wilcoxon rank sum/median hypothesis testsare conducted on the models with the highest task groupmedian R2s or BICs in order to determine if their performanceadvantage is statistically significant, relative to other models, atthe 10% level. (See Golden (2000), for an alternative methodfor statistically testing model comparison indices). These task-best models are indicated with a box around the table entry.

The first thing to observe from the Table 1 Orange juiceresults, left panel, is that the delta rule with exponential decayof learning parameter generally appears to best predict subject’sforecasts based on both the median and frequency performancecriteria. Specifically, for the baseline and structural breakgroups, the D-D model provided statistically larger median R2sand frequency of maximal R2s. These results are significantat less than the 10% level for all model comparisons withinthe baseline and structural break groups. Note that for thebaseline condition with symmetric and additive cues, the N-Lmodel provides nearly as large a median R2 and garners onlya few less maximal R2s across subjects. The D-D model alsoprovided the highest median R2 in the asymmetric cues group,but this model’s performance improvement is not statisticallysignificant for the D-D:N-S and D-D:ALM comparisons.Interestingly, for the high noise treatments the basic fullinformation least-squares model N-L appeared to providea higher median R2 and a substantially higher frequencyof providing the maximal R2. This result is significant atless than the 10% level for the N-L:N-S and N-L:D-Ccomparisons, but not for the average N-L:D-D and N-L:ALMcomparisons, despite the dramatic differences suggested bythe frequency measure. Finally, when we pool ‘all subjects’together, the results suggest that the D-D model is the bestoverall description of the OJ experimental subjects based onboth the median and frequency performance criteria. Although

Page 13: A comparison of models for learning how to dynamically integrate multiple cues in order to forecast continuous criteria

230 H. Kelley, J. Busemeyer / Journal of Mathematical Psychology 52 (2008) 218–240

Tabl

e1

Med

ian

task

grou

pR

2an

dfr

eque

ncy

ofm

axim

alm

odel

fitac

ross

subj

ects

Mod

elN

PO

rang

eju

ice

(tri

als=

480)

MFP

F(t

rial

s=

480)

RO

E(t

rial

s=

350)

Bas

elin

eaA

sym

.H

igh

nois

eSt

ruct

ural

brea

kA

llsu

bj.

Bas

elin

ebA

sym

.cH

igh

nois

edA

llsu

bj.

Add

itive

linea

rA

dditi

veno

nlin

eare

Mul

t.lin

earf

Mul

t.no

nlin

ear

All

subj

.

R2,

freq

.R

2,

freq

.R

2,f

req.

R2,f

req.

R2,f

req.

R2,f

req.

R2,

freq

.R

2,f

req.

R2,f

req.

R2,f

req.

R2,f

req.

R2,

freq

.R

2,f

req.

R2,f

req.

Lea

stsq

uare

s:N

-Lfu

llin

form

atio

n

00.

81,2

70.

49,0

0.76

,

13

0.67

,00.

68,4

00.

56,6

0.37

,50.

68,

7

0.54

,18

0.72

,6

0.46

,20.

52,

4

0.55

,30.

56,1

5

Lea

stsq

uare

s:N

-Ssh

ort

mem

ory

10.

53,7

0.55

,40.

49,1

0.67

,20.

56,1

40.

53,1

0.36

,10.

64,2

0.51

,40.

66,0

0.17

,00.

10,0

0.44

,00.

34,0

Neu

ral

netw

ork:

D-C

cons

tant

lear

ning

rate

10.

80,0

0.46

,00.

71,0

0.70

,00.

67,0

0.53

,10.

48,0

0.65

,00.

55,1

0.76

,30.

49,

60.

49,

6

0.56

,30.

58,

18

Neu

ral

netw

ork:

D-D

deca

ying

lear

ning

rate

20.

83,

32

0.56

,

5

0.75

,50.

76,

16

0.73

,

58

0.53

,7

0.48

,

1

0.65

,30.

55,1

10.

76,2

0.49

,30.

49,0

0.56

,

4

0.58

,9

Neu

ral

netw

ork:

AL

Mex

tend

ed

20.

80,5

0.53

,

6

0.73

,10.

74,6

0.70

,18

0.55

,50.

44,

9

0.64

,00.

54,1

40.

65,1

0.14

,10.

12,0

0.48

,30.

35,5

Tota

lsub

ject

s(7

1)(1

5)(2

0)(2

4)13

0(2

0)(1

6)(1

2)48

g(1

2)(1

2)(1

0)(1

3)47

h

Not

e.Ta

ble

entr

ies

incl

ude

the

med

ian

R2

com

pute

dfr

omth

ein

divi

dual

subj

ect

R2s

for

each

mod

el(r

ow)

ina

part

icul

artr

eatm

entg

roup

(col

umn)

.And

,fol

low

ing

the

com

ma;

the

freq

uenc

yw

ithw

hich

am

odel

prov

ided

the

high

est

R2

for

indi

vidu

als

with

ina

give

ntr

eatm

entg

roup

.Wilc

oxon

rank

sum

hypo

thes

iste

sts

com

pari

ngR

2s

are

also

prov

ided

.The

boxe

sin

dica

teth

em

odel

ina

trea

tmen

tgro

upth

atsi

gnifi

cant

lyou

tper

form

edot

her

mod

els

atth

e10

%le

vel.

Und

erlin

eden

trie

sin

dica

tem

odel

sw

hose

perf

orm

ance

was

not

sign

ifica

ntly

diff

eren

tfr

omth

ebe

stpe

rfor

min

gm

odel

for

agi

ven

trea

tmen

tgr

oup.

The

mod

els

are

desc

ribe

din

the

text

.NP

isth

enu

mbe

rof

fitte

d(o

r‘f

ree’

)pa

ram

eter

sfit

ted

for

each

subj

ectt

om

inim

ize

the

squa

red

pred

ictio

ner

ror

over

allt

rial

s.Fo

rO

Jan

dM

FPF

the

degr

ees

offr

eedo

mar

eD

F=

480

−N

P,fo

rth

eR

OE

data

itis

350

−N

P.T

here

wer

e12

orm

ore

subj

ects

inea

chta

skgr

oup,

asin

dica

ted

byth

ebo

ttom

‘Tot

als’

row

atth

ebo

ttom

ofth

eta

ble.

The

rear

e23

8to

tals

ubje

cts

faci

ng48

0re

peat

edtr

ials

inth

eO

Jan

dM

FPF

stud

ies,

and

350

repe

ated

tria

lsin

the

RO

Est

udy

acro

ss11

trea

tmen

tcon

ditio

ns.

aT

his

grou

pin

clud

esba

selin

e,pa

id,n

osc

ore

and

nohi

stor

ytr

eatm

ents

.b

Inth

isM

FPF

grou

p,1

subj

ectb

ecam

eno

n-re

spon

sive

,and

nom

odel

was

able

topr

ovid

ea

supe

rior

R2.

cIn

this

MFP

Fgr

oup,

1su

bjec

tbec

ame

non-

resp

onsi

ve,a

ndno

mod

elw

asab

leto

prov

ide

asu

peri

orR

2.

dIn

this

MFP

Fgr

oup,

4su

bjec

tsbe

cam

eno

n-re

spon

sive

,and

nom

odel

was

able

topr

ovid

ea

supe

rior

R2.

eIn

this

RO

Egr

oup,

2su

bjec

tsbe

cam

eno

n-re

spon

sive

,and

nom

odel

was

able

topr

ovid

ea

supe

rior

R2.

fIn

this

RO

Egr

oup,

4su

bjec

tsbe

cam

eno

n-re

spon

sive

,and

nom

odel

was

able

topr

ovid

ea

supe

rior

R2.

gT

heto

taln

umbe

rof

subj

ects

for

this

expe

rim

enti

s48

+6

unre

spon

sive

subj

ects

,for

ato

talo

f54

.h

The

tota

lnum

ber

ofsu

bjec

tsfo

rth

isex

peri

men

tis

47+

6un

resp

onsi

vesu

bjec

ts,f

ora

tota

lof

53.

Page 14: A comparison of models for learning how to dynamically integrate multiple cues in order to forecast continuous criteria

H. Kelley, J. Busemeyer / Journal of Mathematical Psychology 52 (2008) 218–240 231

Table 2

Average R2 win margin

Model NP Orange juice (trials = 480) MFPF (trials = 480) ROE (trials = 350)Baselinea Asym. High

noiseStructuralbreak

Allsubj.

Baseline Asym. Highnoise

Allsubj.

Additivelinear

Additivenonlinear

Mult.linear

Mult.nonlinear

Allsubj.

∆R2 ∆R2 ∆R2 ∆R2 ∆R2 ∆R2 ∆R2 ∆R2 ∆R2 ∆R2 ∆R2 ∆R2 ∆R2 ∆R2

Leastsquares:N-L fullinformation

0 0.11 na 0.07 na 0.09 0.03 0.08 0.04 0.05 0.07 0.19 0.14 0.12 0.13

Leastsquares:N-S shortmemory

1 0.45 0.26 0.34 0.10 0.29 0.03 0.11 0.03 0.06 na na na na na

Neuralnetwork:D-Cconstantlearningrate

1 na na na na na 0.19 na na 0.19 0.06 0.16 0.18 0.12 0.13

Neuralnetwork:D-Ddecayinglearningrate

2 0.09 0.04 0.10 0.09 0.08 0.06 0.06 0.02 0.04 0.05 0.13 na na 0.09

Neuralnetwork:ALMextended

2 0.09 0.13 0.10 0.09 0.10 0.14 0.23 na 0.19 0.13 0.06 na na 0.10

Note. Table entries are the average R2 improvement margin for the model at the far left column if it provided the highest R2 relative to the other four models, i.e.,most descriptive fit for the indicated group of subjects. The models are described in the text. NP is the number of fitted (or ‘free’) parameters fitted for each subjectto minimize the squared prediction error over all trials. For OJ and MFPF the degrees of freedom are DF = 480 − NP, for the ROE data it is 350 − NP. There are238 total subjects facing 480 repeated trials in the OJ and MFPF studies, and 350 repeated trials in the ROE study.

a This group includes baseline, paid, no score and no history treatments.

the N-L model comes in a close second, the D-D winningadvantage is significant at the 10% level for all comparisons.

For the Table 1 MFPF data, middle panel, we see amore varied pattern of model prediction dominance. Forthe median R2 performance measure, the least-squares ruleN-L appears to provide the best description of subjects’responses, except for the asymmetric cues task. In thebaseline condition N-L provides a significant, at the 10%level, improvement in predictive performance for the N-L:D-C and N-L:D-D comparisons, but not for the N-L:N-S andN-L:ALM comparisons. For the high noise treatment the N-L median R2 is only greater for the N-L:ALM comparison.Alternatively, for the asymmetric cues case, the delta rulewith exponentially declining learning rate D-D providessignificantly higher median R2s across subjects except for theD-D:ALM comparison. Finally, when we pool ‘all subjects’across tasks groups and recalculate the median R2 for eachmodel, the results again indicate that the neural network modelD-D provides significantly better performance for all modelcomparisons.

However, for the frequency performance measure with theMFPF data, the evidence is more in favor of the least-squaresapproach. For the baseline and asymmetric cues conditions, theresults indicate that the neural network models D-D and ALMhave slight majorities respectively. However, for the high noise

condition, as in the OJ data, there is stronger evidence in favorof the least-squares model N-L. Finally, when examining thepooled data, the N-L model garners the most maximal subjectR2s overall.

Finally for the ROE data, right panel, there is evidence thatthe neural network class of models provides the best descriptionof subjects’ responses for both the median and frequencymeasures of performance. For the all treatment conditions,the models D-C, followed by D-D, provided better or equalmedian R2s and frequencies of maximal R2s across subjects;with a few exceptions. For the additive linear condition,the D-C model provided a significant R2 improvement forthe D-C:N-S and D-C:ALM comparisons, but not for theD-C:N-L or D-C:D-D comparisons. In the additive lineartask, the performance improvement of the D-C model issignificant for all comparisons. For the multiplicative nonlineartask, the D-C model provided a significant performanceimprovement for the D-C:N-L and D-C:N-S comparisons,but not for the D-C:D-D or D-C:ALM comparison. For themultiplicative linear condition, the N-L model provided asignificant performance improvement for the N-L:N-S and N-L:ALM comparisons, but not for the N-L:D-C or N-L:D-Dcomparisons. Across all the ROE subjects the neural networkmodel D-C provided significantly higher across subject R2sfor all model comparisons. Interestingly, for the additive linear

Page 15: A comparison of models for learning how to dynamically integrate multiple cues in order to forecast continuous criteria

232 H. Kelley, J. Busemeyer / Journal of Mathematical Psychology 52 (2008) 218–240

Tabl

e3

Med

ian

task

grou

pB

ICan

dfr

eque

ncy

ofm

axim

alm

odel

fitac

ross

subj

ects

Mod

elN

PO

rang

eju

ice

(tri

als=

480)

MFP

F(t

rial

s=

480)

RO

E(t

rial

s=

350)

Bas

elin

eaA

sym

.H

igh

nois

eSt

ruct

ural

brea

kA

llsu

bj.

Bas

elin

eA

sym

.H

igh

nois

ebA

llsu

bj.

Add

itive

linea

rA

dditi

veno

nlin

earc

Mul

t.lin

eard

Mul

t.no

nlin

ear

All

subj

.

BIC

,fr

eq.

BIC

,fr

eq.

BIC

,fre

q.B

IC,f

req.

BIC

,fr

eq.

BIC

,fr

eq.

BIC

,fr

eq.

BIC

,fr

eq.

BIC

,fr

eq.

BIC

,fre

q.B

IC,f

req.

BIC

,fr

eq.

BIC

,fre

q.B

IC,f

req.

Lea

stsq

uare

s:N

-Lfu

llin

form

atio

n

0−

3364

,

33

−38

04,1

-343

7,

15

−37

26,1

−35

83,

50−

3417

,7−

3590

,6

−32

58,

10−

3422

,

23

−22

11,

6

−22

74,2

−21

83,

7

−23

39,4

−22

52,1

9

Lea

stsq

uare

s:N

-Ssh

ort

mem

ory

1−

3807

,7

−37

38,4

−37

48,1

−37

48,2

−37

60,

14−

3410

,1

−37

26,

1−

3223

,2

−34

53,

4−

2259

,0−

2367

,0−

2365

,0

−24

30,0

−23

55,0

Neu

ral

netw

ork:

D-C

cons

tant

lear

ning

rate

1−

3419

,0

−38

18,0

−35

30,2

−36

58,0

−36

06,

2-3

308

,

7

-349

4,

1

-321

5,

3

-333

9,

11

-215

3,5

-225

3,

9-2

182

,

6

-232

2,

7

-222

8,

27

Neu

ral

netw

ork:

D-D

deca

ying

lear

ning

rate

2-3

322

,

26

−37

98,4

−34

71,2

-357

5,

14

-354

2,4

6−

3314

,1

−35

01,

0−

3221

,0

−33

45,

1−

2158

,0−

2259

,0−

2188

,0

−23

28,0

−22

33,0

Neu

ral

netw

ork:

AL

Mex

tend

ed

2−

3424

,5

-372

4,

6

−35

17,0

−36

01,7

−35

67,

18−

3377

,5

−37

66,

9

−32

99,

0−

3480

,14

−22

61,1

−23

99,1

−23

45,

0−

2382

,2−

2347

,4

Tota

lsub

ject

s(7

1)(1

5)(2

0)(2

4)13

0(2

1)(1

7)(1

5)53

e(1

2)(1

2)(1

3)(1

3)50

f

Not

e.Ta

ble

entr

ies

incl

ude

the

med

ian

BIC

com

pute

dfr

omth

ein

divi

dual

-sub

ject

BIC

sfo

rea

chm

odel

(row

)in

apa

rtic

ular

trea

tmen

tgr

oup

(col

umn)

.And

,fol

low

ing

the

com

ma;

the

freq

uenc

yw

ithw

hich

am

odel

prov

ided

the

smal

lest

BIC

for

indi

vidu

als

with

ina

give

ntr

eatm

ent

grou

p.W

ilcox

onra

nksu

mhy

poth

esis

test

sco

mpa

ring

BIC

sar

eal

sopr

ovid

ed.T

hebo

xes

indi

cate

the

mod

elin

atr

eatm

ent

grou

pth

atsi

gnifi

cant

lyou

tper

form

edot

her

mod

els

atth

e10

%le

vel.

Und

erlin

eden

trie

sin

dica

tem

odel

sw

hose

perf

orm

ance

was

nots

igni

fican

tlydi

ffer

entf

rom

the

best

perf

orm

ing

mod

elfo

ra

give

ntr

eatm

entg

roup

.The

mod

els

are

desc

ribe

din

the

text

.N

Pis

the

num

ber

offit

ted

(or

‘fre

e’)

para

met

ers

fitte

dfo

rea

chsu

bjec

tto

min

imiz

eth

esq

uare

dpr

edic

tion

erro

rov

eral

ltr

ials

.Fo

rO

Jan

dM

FPF

the

degr

ees

offr

eedo

mar

eD

F=

480

−N

P,fo

rth

eR

OE

data

itis

350

−N

P.T

here

wer

e12

orm

ore

subj

ects

inea

chta

skgr

oup,

asin

dica

ted

byth

ebo

ttom

‘Tot

als’

row

atth

ebo

ttom

ofth

eta

ble.

The

rear

e23

8to

tals

ubje

cts

faci

ng48

0re

peat

edtr

ials

inth

eO

Jan

dM

FPF

stud

ies,

and

350

repe

ated

tria

lsin

the

RO

Est

udy

acro

ss11

trea

tmen

tcon

ditio

ns.

aT

his

grou

pin

clud

esba

selin

e,pa

id,n

osc

ore

and

nohi

stor

ytr

eatm

ents

.b

Inth

isM

FPF

grou

p1

subj

ectb

ecam

eno

n-re

spon

sive

,and

nom

odel

was

able

topr

ovid

ea

supe

rior

BIC

.c

Inth

isR

OE

grou

p2

subj

ects

beca

me

non-

resp

onsi

ve,a

ndno

mod

elw

asab

leto

prov

ide

asu

peri

orB

IC.

dIn

this

RO

Egr

oup

1su

bjec

tbec

ame

non-

resp

onsi

ve,a

ndno

mod

elw

asab

leto

prov

ide

asu

peri

orB

IC.

eT

heto

taln

umbe

rof

subj

ects

for

this

expe

rim

enti

s53

+1

unre

spon

sive

subj

ects

,for

ato

talo

f54

.f

The

tota

lnum

ber

ofsu

bjec

tsfo

rth

isex

peri

men

tis

50+

3un

resp

onsi

vesu

bjec

ts,f

ora

tota

lof

53.

Page 16: A comparison of models for learning how to dynamically integrate multiple cues in order to forecast continuous criteria

H. Kelley, J. Busemeyer / Journal of Mathematical Psychology 52 (2008) 218–240 233

Table 4Average BIC win margin

Model NP Orange juice (trials = 480) MFPF (trials = 480) ROE (trials = 350)Baselinea Asym. High

noiseStructuralbreak

Allsubj.

Baseline Asym. Highnoise

Allsubj.

Additivelinear

Additivenonlinear

Mult.linear

Mult.nonlinear

Allsubj.

∆BIC ∆BIC ∆BIC ∆BIC ∆BIC ∆BIC ∆BIC ∆BIC ∆BIC BIC ∆BIC ∆BIC ∆BIC ∆BIC

Leastsquares:N-L fullinformation

0 206 34 136 106 121 52 115 67 78 80 103 60 82 81

Leastsquares:N-S shortmemory

1 665 295 282 156 350 18 94 18 43 na na na na na

Neuralnetwork:D-Cconstantlearningrate

1 na na 46 na 46 43 93 33 56 95 78 103 76 88

Neuralnetwork:D-Ddecayinglearningrate

2 144 55 147 163 127 62 na na 62 na na na na na

Neuralnetwork:ALMextended

2 115 75 na 119 103 118 203 na 161 71 76 na 97 81

Note. Table entries are the average BIC improvement margin for the model at the far left column if it provided the smallest BIC relative to the other four models,i.e., most descriptive fit for the indicated group of subjects. The models are described in the text. NP is the number of fitted (or ‘free’) parameters fitted for eachsubject to minimize the squared prediction error over all trials. For OJ and MFPF the degrees of freedom are DF = 480 − NP, for the ROE data it is 350 − NP.There are 238 total subjects facing 480 repeated trials in the OJ and MFPF studies, and 350 repeated trials in the ROE study.

a This group includes baseline, paid, no score and no history treatments.

treatment of this experiment, a deterministic version of thebaseline task for the OJ experiment, the neural network modelsprovided a slightly higher median R2, however, as in theprevious cases, the least-squares model N-L came in a closesecond when considering the frequency measure.

An important limitation of our frequency-based measure ofmodel performance is that information about the magnitude ofprediction improvements is lost. We address this problem inTable 2 by reporting the by-task group average R2 win marginof a model, given that the model provides the maximal R2 forat least one subject in the group. ‘na’ refers to a model that didnot provide the best description of at least one subject from agroup. The results in this table indicate that, across tasks, theaverage R2 improvement of the best model is moderate, for themost part in the range ∆R2

⊂ (0.02, 0.20), and is typicallyaround 15%. For the few times that the least-squares modelwith short memory dominates, the prediction improvement isa bit larger, typically within the range ∆R2

⊂ (0.03, 0.45), upto 45%. Alternatively, for the frequency of maximal R2 criteria,the advantage conveyed upon any winning model is much moresignificant. Generally, the winning model captures 1.5–2 timesas many maximal R2 wins across subjects. As a result, thisperformance measure tends to predict much larger performancevariations.

Tables 3 and 4 report the results for the absoluteBayesian information criterion measure of external descriptiveperformance; here smaller is better. The implications are similarto those reported for the median R2s with a few exceptions.In general, this measure gives somewhat more support for theBayesian, least-squares, model by either providing a higherfrequency of minimal BICs or by making the performanceimprovement of the neural network models indistinguishablefrom the least-squares models.

Specifically, for the OJ data, the neural network modelD-D again appears to be the best description of subjects’actions. For all treatments, the median BIC is smallest for theneural network class of models. However, for the frequencymeasure, more support is provided for the N-L model. Further,the dominance of the N-L model for the high noise treatmentis reinforced. In terms of significance of the BIC differences,for the baseline condition the N-L model is statisticallyindistinguishable from the D-D model.

The primary difference between the R2 and BIC measuresof descriptive performance are provided in the MFPF data.There, the neural network models appear to provide statisticallysignificant better performance by providing smaller medianBIC values. Specifically, for both the baseline and high noiseconditions, smaller median BICs are provided by D-D. Andfurther, similar to the results from Table 1 when pooling all

Page 17: A comparison of models for learning how to dynamically integrate multiple cues in order to forecast continuous criteria

234 H. Kelley, J. Busemeyer / Journal of Mathematical Psychology 52 (2008) 218–240

subjects, the D-D model significantly outperforms the least-squares models in terms of BIC medians, but N-L modeldominates based upon frequency.

Finally, for the ROE data, the median BICs are smallest forthe neural network models. For both median BIC and frequencymeasures of performance, the D-C model provides significantimprovements in the BIC values with two exceptions. First,for the additive linear, multiplicative linear and multiplicativenonlinear conditions, the BIC improvement for the D-C:N-Lcomparison is not significant. Second, for the additive linearand multiplicative linear treatments, although the D-C modelprovided the smallest median BIC, the N-L model provided thehighest frequency of minimal BICs. Finally, considering ‘allsubjects’ the D-C model provides significantly smaller groupBICs.

Table 4 reports the by-task group average BIC win marginof a model, given that the model provides the minimal BICfor at least one subject in the group. The results indicatethat, across all tasks, the average BIC improvement of awinning model is small and in the range ∆BIC ⊂ (40, 140),representing about a 3% improvement per subject. However,for the times that the short memory L-S model dominates,the prediction improvement is again larger, within the range∆BIC ⊂ (156, 665) up to 20% gain. Again, for the frequencycriteria, the winning advantages are somewhat larger.

An additional result we observe is that our median andfrequency measures of forecast performance do not alwaysagree about what is the best performing model. Fromexperiments we see that the maximal median R2 (minimal BIC)and the maximal frequency predictions diverge for several taskgroups. This effect could be due to the presence of individual-subject heterogeneity. For instance, if there is significantsubject heterogeneity, crucial information is lost when poolingsubjects R2s or BIC values to obtain group performanceaggregates. Since the frequency measure does not compressthis information, the measures may diverge and the frequencyapproach may be preferred.

4.2. Subject heterogeneity

Empirical evidence about subject heterogeneity is providedby the R2 distributions in Fig. 1. This figure provides all238 subjects’ R2s for each model. It is important to considerthe across-subject distributions of goodness-of-fit measuresin order to visually observe the degree of across-subjectheterogeneity. We also include the R2s that were equal tozero in order to demonstrate the number of cases whenthe model failed to describe human responses. This is alsoimportant to show in order to highlight the generalizability ofthe model across tasks. Models which often result in goodness-of-fit measures of zero are not likely to be generalizabilityacross people or tasks. Similarly distributed BIC values arenot reported for brevity. The medians give an indicationof how well a particular learning model describes subjects.And the distributional variability in each panel characterizesheterogeneity in subjects’ learning processes. Note, thatunresponsive subjects resulted in R2s near zero for any

experiment, This partially accounts for the frequency spikes atthe left of the various panels. Ordinarily, such subjects wouldbe omitted from the analysis; we include them for completenessand since most of our analysis focuses on medians, which attacha smaller weight to these outliers.

For the models with subject specific free parameters, Fig. 1panels (B)–(E), we see most often a larger percentage ofthe mass of subjects’ R2s lying to the right compared topanel (A). This indicates that models allowing subject specificparameters have statistically superior fits, indicating there areimportant subject heterogeneities to consider. Further, about2/3 of subjects have estimated R2s within the 0.7–0.95 range.And, upon further inspection it was observed that the fewsubjects who had especially low R2s had become unresponsiveafter many repeated trials. As a result, the R2 aggregates wereport in Table 1 represent conservatively low values, sincethese ‘bad subjects’ drive the medians toward 0. Similarly, thesebad subjects drive the median BIC values down in Table 3. Ingeneral, this heterogeneity suggests that our choice of a median-based measure of performance may be more appropriatethan the average measure, unless the bad ‘0’ subjects weresubjectively dropped from the sample.

Specific results illustrating subject heterogeneity areprovided in the Fig. 3 estimated parameters distributions.Generally, these parameters are widely distributed and displaymultiple modal tendencies. However, two regularities are ofinterest. First, panels (A)–(C) provide evidence in favor ofour prior hypothesis from Section 4.2 predicting that theestimated window size for the N-S model should be smallerfor experiments with a larger number of stimuli dimensions ifa memory constraint is binding. Comparing the distributionsfor win across the three experiments panels we see that thedistributions for the 5 cue MFPF experiment is shifted to theleft and has a smaller group of maximum window lengths.This evidence of short memory should be tempered by the factthat this feature alone did not provide this model a winningpredictive advantage. Secondly, the constant delta learningcoefficient δ, Eq. (11), has a fairly robust range around 0.1–0.4.This implies that this representation of subjects’ learninginvolves adapting association weights by 10%–40% of forecasterror in a trial.

4.3. The effects of task difficulty

Another important question concerns whether task difficultyaffects subjects’ responses or the ability of these models toaccurately predict performance. Intuition might suggest that asthe task becomes more difficult the behavior of participants maybecome more erratic producing model performance declines.Alternatively, rather than interpreting performance variation asa difficulty effect, one might see by-task performance variationsas information about function forms that make it difficult forthese models to predict behavior; perhaps as subjects switch toalternative processes not captured by these models. We compareperformance at by-experiment and by-task levels.

First, consider the by-experiment mean (median) pooledR2s for all subjects and each model. In Fig. 2 OJ (top panel)

Page 18: A comparison of models for learning how to dynamically integrate multiple cues in order to forecast continuous criteria

H. Kelley, J. Busemeyer / Journal of Mathematical Psychology 52 (2008) 218–240 235

Fig. 1. Distribution of R-sqrs for all subjects and experiments by learning model: (A) N-L, normal least squares with full information; (B) N-S, least squares withshort memory; (C) D-C, neural network with constant learning rate; (D) D-D, neural network with decaying learning rate; (E) ALM, Busemeyer et al. (1993).

there are 130(subjects) ∗ 5(models) = 650 observations, forMFPF (middle) there are 55 ∗ 5 = 275 observations, andfor ROE (bottom) there are 53 ∗ 5 = 265 observations. Theordering of these R2 means and medians (standard deviations,henceforth SD) can give an indication of the variability ofdifficulty across (within) each experiment. We observe that theOJ mean (median) R2 is 0.66 (0.73) with a SD of 0.22. Forthe five cue MFPF experiment the mean (median) R2 is 0.44(0.54) with a SD of 0.3. Finally, for the Roe experiment themean (median) R2 is 0.44 (0.49) with a SD of 0.34. The meanssuggest that overall the OJ experiment was easiest, i.e. had thehighest mean R2s and the remaining experiments were of equaldifficulty. The medians suggest the OJ was again the easiest,but that the MFPF experiment was slightly easier than the ROEtasks. The standard deviations describe how much variation intask difficulty, and subject heterogeneity, is present within eachexperiment. The ROE task had the highest SD indicating thatthis experiment presented the most varied set of tasks in termsof difficulty or fitting ability, also indicating the greatest subjectheterogeneity followed by the MFPF and OJ settings.

We next consider if there are systematic by-task variationsin R2s across the eleven task groups. As an a priori ranking oftask difficulty, and declining R2, we rank the additive linear,but deterministic, task from the ROE experiment as easiestfollowed by the OJ and then the MFPF experiments. Finally,the most difficult functions to learn may be the multiplicativeand nonlinear tasks from the ROE experiment. Intuitively,

processing deterministic cues is easier than processingstochastic ones. And, processing two linearly combined cuesis easier than five. Finally, processing additively combined cuesis easier than multiplicative or nonlinear relations, see Kelleyand Friedman (2003) for additional discussion.

Empirical evidence for this difficulty ranking obtains froma comparison of the by-task and across-model R2 mediansat Table 1. The data in this table indicate that the baselineand structural break OJ tasks and the baseline ROE task arethe easiest, i.e., have the highest median R2s. These taskscorrespond to stochastic or deterministic functions with few,symmetric, and additively combined cues. Next, the high noiseOJ treatment and the high noise and baseline MFPF treatmentshave the next lowest R2s. The ranking of the remaining groupsis a bit less clear. Focusing on the median R2 for the bestperforming model suggests that the asymmetric 2-cues OJ andmultiplicative treatments of the ROE experiment are the nextmost difficult tasks. Finally, the most difficult tasks are theasymmetric validity 5-cues MFPF and additive nonlinear ROEexperiments.

Fig. 4 provides additional evidence about task difficulty asdemonstrated by the time series of subjects’ average squaredforecast errors for key groups. Panel (A) demonstrates thatthe average forecast error (between across-subjects averageforecast and the target) in the simplest OJ task quickly declinestoward a consistently low value. Panel (B) shows that the moredifficult 2 cue asymmetric weight tasks produce more frequent

Page 19: A comparison of models for learning how to dynamically integrate multiple cues in order to forecast continuous criteria

236 H. Kelley, J. Busemeyer / Journal of Mathematical Psychology 52 (2008) 218–240

Fig. 2. Distribution of R-sqrs by experiment for all subjects and models.

errors. Panel (C) demonstrates that the 5 cue tasks producemuch larger errors. Finally, panel (D) provides results for theROE tasks. In this last deterministic experiment forecast errorsare smaller by an order of magnitude since one could eventuallyaccurately calculate the correct responses. However, despitethese low errors, describing the entire process of linear andnonlinear learning proved relatively difficult with this set ofmodels as also demonstrated by the left-shifted distribution ofR2s in Fig. 2 panel (C).

4.4. The relation among asymptotic prediction error andgoodness-of-fit

The learning curves provided at Fig. 4 also demonstratehow important learning dynamics are for our models’ estimatedgoodness-of-fit. Although a model may make small asymptoticprediction errors, large early errors can result in a lowgoodness-of-fit. To demonstrate this we calculate an R2 amongaveraged ‘subject’ forecasts and the target; these data aresummarized as the errors in Fig. 4. Here, the R2 describes theamount of subject forecast variance explained by the true target.This is distinct from the R2s reported earlier that represent thesubject forecast variance explained by model predictions. Thisdifference is immaterial for the demonstration below.

Considering Panel D of Fig. 4, we see fairly large earlydeviations among average subject forecasts and the target. Forearly trials there is an average squared deviation among the‘subject’ and the actual target of SSR ≈ 0.003 for each oftrials 1–40. Alternatively, there are much smaller errors in latertrials, SSR ≈ 5.0e-5 for each of the last 310 trials. At the sametime, the variability of target, measured as the squared deviationof the target minus its sample average, can be shown to beTSS ≈ 0.006 per trial on trials 1–40, and TSS ≈ 0.0003 for thelast 310 trials. Despite the small asymptotic prediction errorsfor this case, the all-trials R2 relating this hypothetical average‘subject’ to the target is low. This can be confirmed by simplycumulating the SSR and TSS for the two subsets of trials andrecovering the hypothetical ‘subject’ R2 as 1−SSR/TSS = 1−

(40×0.003+310×5e−05)/(40×0.006+310×3e−4) ≈ 0.41.As we are interested in the entire sequence of learning, we focuson this cumulative measure even though it essentially discountssmall asymptotic errors and magnifies early ones in calculatinggoodness-of-fit.

5. Discussion

We provide a model comparison analysis of alternativetheories of function learning. We also quantify the degreeof across-subject heterogeneity and asses whether theperformance of alternative models of learning differs acrosstasks varying the form of the learned function. We believeour results can be helpful for informing researchers about howto construct a more general theory of learning by providinginsights into key elements of models of expectations formationand the generalizability of these elements.

The results from our study are based on model fits of fivevariants of popular economic and cognitive learning models to238 individual participants. The models included two Bayesianand two neural network rule-based models, and one associativelearning model. Each of the participants faced one of elevendifferent function learning tasks. For all experiments, themodels and subjects perform extrapolation while learning. Ofcourse, we only consider a subset of possible models andfunctional relationships, thus one must qualify our results bysaying that they are only directly relevant to our models andexperimental data. However, this does not preclude these resultsfrom being considered suggestive about regularities that may berelevant elsewhere.

In terms of our primary goal of model comparison, the rule-based neural network models are observed to have the highestmedian participant R2s for 7 of 11 task groups, and the highestindividual-subject frequency of maximal R2 in 7 of 11 taskgroups. This means the simple rule-based network updatingpredictions provide the best overall description of individual-subjects’ forecasts in early and late stages of learning. Inparticular, the advantage of these models stems from their moreaccurate predictions of the early stage of human learning. Inthis region, human dynamic information integration appearsmore similar to neural network updating with a rule, comparedto more efficient Bayesian or associative integration. In abroader sense, this result suggests that perhaps theoretical

Page 20: A comparison of models for learning how to dynamically integrate multiple cues in order to forecast continuous criteria

H. Kelley, J. Busemeyer / Journal of Mathematical Psychology 52 (2008) 218–240 237

social sciences might consider network-based representationsin addition the Bayesian baselines when specifying forecastingagents.

Importantly, by using neural networks without hidden layers,the results reported in this study may not be reporting the fullpotential of these models to capture nonlinear relationships.In fact, any performance dominance demonstrated by theneural network rule-based models is likely to be a conservativeestimate. And further, there does not appear to be evidence thatthe domain frame of the experiment (economic versus medical)or other experimental methodological details (location, numberof subjects, length of experiment, etc.) influences the relativemodel performance rankings.

Our second goal is to compare the performance of eachmodel across tasks to see if particular model elements provideperformance advantages for certain functions. We postulate thatif the externally observed characteristics of subjects’ learningare indeed task specific in terms of the elements summarized byour models, one should observe the most accurate descriptivemodel varying by task. For the alternative task independencehypothesis, we should observe one best descriptive modelproviding the most accurate predictions of subjects’ forecastsacross tasks. Our result above indicates task independenceor generality of learning is the case for the limited functiondomain relevant to socio-economic theory studied here. Thiscontrasts to the results of Kitzis, Kelley, Berg, Massaro, andFriedman (1998), who found task independent advantagesfor various least-squares models for stochastic categorizationtasks. This may reflect the fact that function learning is aqualitatively different task than categorization, Friedman andMassaro (1998). Nevertheless, the least-squares rule-basedmodel was often a second best description of our subjects.Importantly, the median R2 and BIC values reflecting model fitsvaried significantly across treatment groups. This indicated thatthis rule-based model adapted to provide the best description ofsubject responses across statistically distinct tasks.

There are three interesting caveats to these results. In somecases the performance difference between the neural networkrule-based and associative models is statistically insignificant,suggesting associative learning may be equal in importance toneural network updating in some cases. These cases includedall the asymmetric weight tasks, the few-cue high noise,and multiplicative nonlinear tasks. The second caveat, as inAnderson (1991) there appears to be some limited evidenceof the task specificity of learning for two manipulations. Intasks with high noise or many cues the Bayesian rule-basedforecasting approach out performed alternatives in predictinghuman responses. And thirdly, for tasks with asymmetricweights, the associative approach had an odd advantage. It oftenprovided the best description of the largest group of individualsubjects, giving it a frequency advantage over other models.However, for these subjects the measure of fit is low. So ifthe associative model is compared to others with a centraltendency measure it appears inferior due to the alternativesbeing a superior description for just enough subjects.

In summary, these model comparison and task independenceresults indicate that the neural network rule-based models

provided the best description of our subjects in simple linearand low stochasticity tasks with few cues, or in tasks withstructural breaks in the functional relation. However, with manyor highly stochastic cues, or nonlinear relations, elements aremissing from this and other approaches considered here whichadvantages the Bayesian rule-based methodology. Finally,subjects facing asymmetric weight functions were betterdescribed with the associative approach. One might considerthese results suggestive about what might be a good defaultmodel of forecasting and learning even outside the functionclasses considered here. Or about what specific functional tasksmight lead humans to be better described by an alternativemodel.

Our final result is extensive evidence of subject heterogene-ity. First, the models allowing subject specific parameters to beestimated provide statistically significant improvements in pre-diction accuracy, with the exception of the short memory pa-rameter for the Bayesian rule-based model. This implies thatthere are often individual-subject heterogeneities that are im-portant for providing a good external description of learning.Also, the evidence reported in Figs. 3 and 6 indicates that thereis a wide distribution for key estimated parameters, measures offit, and forecasts, further indicating subject heterogeneity withpotentially multiple modal tendencies. Interestingly, this multi-modality is consistent with predictions of other researchers,Kalish et al. (2004). The basic implication of these results isthat, when constructing a representation of human learning it isimportant to be able to allow for subject heterogeneity vis-a-visfree parameters.

Additionally, our evidence summarized in Fig. 2 presentsone perspective on quantifying task difficulty. Namely, thosetasks with the largest model R2s may be the most simplefor humans and the easiest for us to modelers to forecast;while low R2s may represent more difficult tasks for subjectsresulting in random responses that are difficult to capturewith models. If true, this indicates that easier high R2 tasksinvolve: functions with deterministic, additive, or few cues;next, lower R2 tasks include those with few cues and highstochasticity, the baseline and high stochasticity many cuetasks, and the asymmetric few and many cue tasks; finally, thelowest estimated R2 tasks are the multiplicative and nonlineardeterministic tasks. Interestingly, despite the relative difficultyof modeling subjects’ actions with multiplicative and nonlinearforms, the asymptotic forecast error of subjects is small, Fig. 4.This suggests that although modeling the early sequence oferrors humans make when facing higher-dimensional formsis difficult for these models, asymptotic performance may begood. This evidence is generally consistent with the difficultydiscussion summarized in Kalish et al. (2004).

Finally, pertaining to field relevance we can provide afew very speculative conjectures about how these results mayhave relevance to particular economic forecasting problems.Considering that economic systems are composed of interactinghumans, it is not too much of a leap to suggest thatindividual-human responses may aggregate up to the systemlevel and be relevant for forecasting macroeconomic variables,see DeLong, Shleifer, Summers, and Waldmann (1990). In

Page 21: A comparison of models for learning how to dynamically integrate multiple cues in order to forecast continuous criteria

238 H. Kelley, J. Busemeyer / Journal of Mathematical Psychology 52 (2008) 218–240

Fig. 3. Distribution of optimal window sizes for N-S (A) OJ data; (B) MFPF data; (C) ROE data; Distribution of learning parameters for delta rule with constantlearning D-C (D) OJ; (E) MFPF; (F) ROE; Distribution of decay of learning parameter for delta rule with decaying learning D-D (G) OJ; (H) MFPF; (I) ROE.

Fig. 4. Squared forecast error between subjects’ forecast and criterion value across trials for: (A) OJ symmetric weight subjects 1–9 average; (B) OJ asymmetricweight subjects average (C) MFPF all subjects average; (D) ROE all subjects average.

particular, one might expect to observe Bayesian rule-basedmodel performance advantages when forecasting variablescomposed of many drivers like economic aggregates includingGNP, or major stock market indices such as DJIA, NIKKI,

DAX, FTSE. The Bayesian method may also be preferred whendealing with highly stochastic variables such as exchange rates,or variables likely described by nonlinear functions such asinput-output relationships for productive firms, and to some

Page 22: A comparison of models for learning how to dynamically integrate multiple cues in order to forecast continuous criteria

H. Kelley, J. Busemeyer / Journal of Mathematical Psychology 52 (2008) 218–240 239

Fig. 5a. Representation of MFPF experiment forecasting and decision screen.

Fig. 5b. Representation of ROE experiment forecasting and decision screen.

Fig. 6a. Scatter plot of MFPF subject responses versus target for all subjects,treatments, and trials. Solid line represents coordinate of correct responsesversus criterion. Dotted lines represent boundary of 2 standard deviation errorabove or below correct target value.

extent stock prices. For variables that do not have these features,our results suggest that the simple neural network rule-basedmodel will generally provide the most accurate predictionsof human response, and by extension the macroeconomicvariables influenced by these responses.

Acknowledgments

Research supported in part by NIMH T32 MH19879-09awarded to the first author and by NIMH R01 MH068346to the second author. We thank Robert Roe for his generous

Fig. 6b. Scatter plot of ROE subject responses versus target for all subjects,treatments, and trials. Solid line represents coordinate of correct responsesversus criterion. Dotted lines represent boundary of 2 standard deviation errorabove or below correct target value.

contribution of the ROE experimental data and initialprogramming assistance. Please direct all correspondence to:Hugh Kelley, Copenhagen University, Faculty of Natural andLife Sciences, FOI, Frederiksberg C, Denmark, 1870. We thankDan Friedman, and the participants of the 32nd annual meetingof the Society for Mathematical Psychology at the Universityof California, Santa Cruz.

References

Anderson, J. (1991). The adaptive nature of human categorization.Psychological Review, 98(3), 409–429.

Birnbaum, M. H. (1976). Intuitive numerical prediction. American Journal ofPsychology, 89(3), 417–429.

Bray, M. (1985). Rational expectations, information and asset markets: Anintroduction. Oxford Economic Papers, 37(2), 161–195.

Brehmer, B. (1973). Effects of cue validity on interpersonal learning ofinference tasks with linear and nonlinear cues. American Journal ofPsychology, 86(1), 29–48.

Busemeyer, J. R., Myung, I. J., & McDaniel, M. A. (1993). Cue competitioneffects: Theoretical implications for adaptive network learning models.Psychological Science, 4.

Carroll, J. D. (1963). Functional learning: The learning of continuousfunctional mappings relating stimulus and response continua. Princeton,NJ: Educ. Testing Service.

DeLong, J. B., Shleifer, A., Summers, L., & Waldmann, R. J. (1990). Noisetraders and risk in financial markets. Journal of Political Economy, 98(4),703–738.

Delosh, E, Busemeyer, J. R., Byun, E., & McDaniel, M. A. (1997). Functionlearning based on experience with input–output pairs by humans andartificial neural networks. In K. Lamberts, & D. Shanks (Eds.), Conceptsand categories. Hove, East Sussex, UK: Psychology Press.

Diebold, F. (1998). The past, present, and future of macroeconomic forecasting.Journal of Economic Perspectives, 12(2), 175–192.

Diebold, F., & Lopez, J. (1996). Forecast evaluation and combination.In Handbook of statistics. Amsterdam: North-Holland.

Dudycha, L. W., & Naylor, J. C. (1966). Characteristics of the human inferenceprocess in complex choice behavior situations. Organizational Behaviorand Human Decision Processes, 1(1), 110–128.

Friedman, D., & Massaro, D. (1998). Understanding variability in binary andcontinuous choice. Psychonomic Bulletin and Review, 5(3), 370–389.

Page 23: A comparison of models for learning how to dynamically integrate multiple cues in order to forecast continuous criteria

240 H. Kelley, J. Busemeyer / Journal of Mathematical Psychology 52 (2008) 218–240

Golden, R. (2000). Statistical tests for comparing possible misspecified andnon-nested models. Journal of Mathematical Psychology, 44(1), 153–170.

Gonzalez, S. (2000). Neural networks for macroeconomic forecasting: Acomplementary approach to linear regression models. In Economic studiesand policy analysis division. Department of Finance, Canada, Workingpaper 2000–07.

Gluck, M., & Bower, G. (1988). From conditioning to category learning: Anadaptive network model. Journal of Experimental Psychology: General,117, 225–244.

Holzworth, J. (1999). Annotated bibliography of cue probability learningstudies. Department of Psychology University of Connecticut.

Juslin, P., Olsson, H., & Olsson, A.-C. (2003). Exemplar effects incategorization and multiple-cue judgment. Journal of ExperimentalPsychology: General, 132, 133–156.

Kalish, M., Lewandowsky, S., & Kruschke, J. (2004). Population of linearexperts: Knowledge partitioning and function learning. PsychologicalReview, 111(4), 1072–1099.

Kelley, H. (1998). Learning to forecast price in the laboratory and in the field.Unpublished thesis. Economics Department, University of California SantaCruz.

Kelley, H., & Friedman, D. (2003). Learning to forecast price. EconomicInquiry, 40(4).

Kelley, H., & Friedman, D. (2004). Learning to forecast rationally. Prepared forCharles Plott & Vernon Smith, (Eds.), Handbook of experimental economicsresults.

Kitzis, S., Kelley, H., Berg, E., Massaro, D., & Friedman, D. (1998).Broadening the tests of learning models. Journal of MathematicalPsychology, 42, 327–355.

Klayman, J. (1988). Cue discovery in probabilistic environments: Uncertaintyand experimentation. Journal of Experimental Psychology: Learning,Memory, Cognition, 14.

Koh, K., & Meyer, D. E. (1991). Function learning: Induction of continuousstimulus response relations. Journal of Experimental Psychology: Learning,Memory, and Cognition, 17(5), 811–836.

Koh, K. (1993). Induction of combination rules in two-dimensional functionlearning. Memory and Cognition, 21(5), 573–590.

Kuan, C., & Liu, T. (1995). Forecasting exchange rates using feed forwardand recurrent neural networks. Journal of Applied Econometrics, 10(4),347–364.

Lewandowsky, S., Kalish, M., & Ngang, S. (2000). Simplified learning incomplex situations: Knowledge partitioning in function learning. Mimeo,Department of Psychology University of Western Australia.

Lichtenstein, S., & Slovic, P. (1971). Reversals of preference between bids and

choices in gambling decisions. Journal of Experimental Psychology, 89,46–55.

Marcet, A., & Sargent, T. (1989a). Convergence of least squares learningmechanisms in self referential linear stochastic models. Journal ofEconomic Theory, 48, 337–368.

Marcet, A., & Sargent, T. (1989b). Convergence of least squares learning inenvironments with hidden state variables and private information. Journalof Political Economy, 97, 1306–1322.

Marcet, A., & Sargent, T. (1989c). Least squares and the dynamics ofhyperinflation. In W. Barnett, J. Geweke, & K. Shell (Eds.), Chaos,Complexity, and Sunspots. Cambridge University Press.

MasCollel, A., Whinston, M., & Green, J. (1995). Microeconomic theory. Oxf.Univ. Pr..

Mellers, B. (1981). Configurality in multiple cue probability learning. AmericanJournal of Psychology, 93, 429–443.

Mellers, B. (1986). Test of a distributional theory of intuitive numericalprediction. Organizational Behavior and Human Decision Processes, 38,279–294.

Peterson, C. R., Hammond, K. R., & Summers, D. A. (1965). Multipleprobability-learning with shifting weights of cues. American Journal ofPsychology, 78, 660–663.

Refenes, A., Zapranis, A., & Francis, G. (1994). Stock performance modelingusing neural networks: A comparative study with regression models. NeuralNetworks, 7(2), 375–388.

Rescorla, R., & Wagner, A. (1972). A theory of Pavlovian conditioning:Variations in the effectiveness of Reinforcement and non-reinforcement.In A. H. Black, & W. F. Prokasy (Eds.), Classical conditioning: II. Currentresearch and theory (pp. 64–99). New York: Appleton-Century-Crofts.

Stone, C. J. (1986). A non-parametric framework for statistical modeling.In Proceedings of the international congress of mathematicians (pp.1052–1056).

Summers, D. A. (1969). Adaptation to change in multiple probability tasks.American Journal of Psychology, 82(2), 235–240.

Surber, C. (1987). A formal representation of qualitative and quantitativereversible operations. In C. Brainerd, J. Bisanz, & R. Kail (Eds.), Formalmethods in developmental psychology (pp. 115–154). New York: Springer-Verlag.

Wasserman, L. (2000). Asymptotic inference for mixture models using datadependent priors. Journal of the Royal Statistical Society, 62, 159–180.

White, H. (1989). Learning in artificial neural networks: A StatisticalPerspective. Neural Computation, 1, 425–464.

Wong, B., & Selvi, Y. (1998). Neural network application sin finance: A reviewand analysis of literature. Information and Management, 34, 129–139.