paying (for) attention: the impact of information ...paying (for) attention: the impact of...

26
Paying (for) Attention: The Impact of Information Processing Costs on Bayesian Inference * Scott Duke Kominers Xiaosheng Mu Alexander Peysakhovich § November 29, 2016 Abstract Human information processing is often modeled as costless Bayesian inference. However, research in psychology shows that attention is a computationally costly and potentially limited resource. We study a Bayesian individual for whom comput- ing posterior beliefs is costly. Such an agent faces a tradeoff between economizing on attention costs and having more accurate beliefs. We show that even small pro- cessing costs can lead to significant departures from the standard costless processing model. There exist situations where beliefs can cycle persistently and never con- verge. In addition, when updating is costly, agents are more sensitive to signals about rare events than to signals about common events. Thus, these individuals can permanently overestimate the likelihood of rare events (e.g., the probability of a plane crash). There is a commonly held assumption in economics that individuals will converge to correct beliefs/optimal behavior given sufficient experience. Our results contribute to a growing literature in psychology, neuroscience, and behav- ioral economics suggesting that this assumption is both theoretically and empirically fragile. * The authors appreciate the helpful comments of Alessandro Bonatti, Roland Fryer, Drew Fudenberg, Martin Nowak, Larry Samuelson, Andrei Shleifer, Jakub Steiner, Juuso Toikka, and seminar audiences at the Massachusetts Institute of Technology. Kominers appreciates the hospitality of Microsoft Research New England, which hosted him during parts of this research, and gratefully acknowledges the support of NSF Grants CCF-1216095 and SES-1459912, the Harvard Milton Fund, an AMS–Simons travel grant, and the Human Capital and Economic Opportunity Working Group sponsored by the Institute for New Economic Thinking. Society of Fellows, Department of Economics, Center of Mathematical Sciences and Applications, Center for Research on Computation and Society, and Program for Evolutionary Dynamics, Harvard University, Harvard Business School, and National Bureau of Economic Research. Department of Economics, Harvard University. § Human Cooperation Lab, Yale University. 1

Upload: others

Post on 03-Jun-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Paying (for) Attention: The Impact of Information ...Paying (for) Attention: The Impact of Information Processing Costs on Bayesian Inference Scott Duke Kominersy Xiaosheng Muz Alexander

Paying (for) Attention: The Impact of InformationProcessing Costs on Bayesian Inference∗

Scott Duke Kominers† Xiaosheng Mu‡ Alexander Peysakhovich§

November 29, 2016

Abstract

Human information processing is often modeled as costless Bayesian inference.However, research in psychology shows that attention is a computationally costlyand potentially limited resource. We study a Bayesian individual for whom comput-ing posterior beliefs is costly. Such an agent faces a tradeoff between economizingon attention costs and having more accurate beliefs. We show that even small pro-cessing costs can lead to significant departures from the standard costless processingmodel. There exist situations where beliefs can cycle persistently and never con-verge. In addition, when updating is costly, agents are more sensitive to signalsabout rare events than to signals about common events. Thus, these individualscan permanently overestimate the likelihood of rare events (e.g., the probability ofa plane crash). There is a commonly held assumption in economics that individualswill converge to correct beliefs/optimal behavior given sufficient experience. Ourresults contribute to a growing literature in psychology, neuroscience, and behav-ioral economics suggesting that this assumption is both theoretically and empiricallyfragile.

∗The authors appreciate the helpful comments of Alessandro Bonatti, Roland Fryer, Drew Fudenberg,Martin Nowak, Larry Samuelson, Andrei Shleifer, Jakub Steiner, Juuso Toikka, and seminar audiences atthe Massachusetts Institute of Technology. Kominers appreciates the hospitality of Microsoft ResearchNew England, which hosted him during parts of this research, and gratefully acknowledges the supportof NSF Grants CCF-1216095 and SES-1459912, the Harvard Milton Fund, an AMS–Simons travel grant,and the Human Capital and Economic Opportunity Working Group sponsored by the Institute for NewEconomic Thinking.

†Society of Fellows, Department of Economics, Center of Mathematical Sciences and Applications,Center for Research on Computation and Society, and Program for Evolutionary Dynamics, HarvardUniversity, Harvard Business School, and National Bureau of Economic Research.

‡Department of Economics, Harvard University.§Human Cooperation Lab, Yale University.

1

Page 2: Paying (for) Attention: The Impact of Information ...Paying (for) Attention: The Impact of Information Processing Costs on Bayesian Inference Scott Duke Kominersy Xiaosheng Muz Alexander

1 Introduction

Modeling human information processing as Bayesian is standard in cognitive science, neu-roscience, and economics: scientists have modeled visual perception as a form of Bayesianinference (Knill and Richards (1996)); psychologists have argued that babies are “intuitiveBayesians” (Gopnik (2010)); neuroscience has attempted to investigate how the brainwould perform Bayesian inferences (Doya (2007)); and economists have used Bayesianmodels to justify other assumptions on agent behavior (Epstein and Le Breton (1993)).1

The Bayesian actor model assumes not only an ability to update beliefs, but an abilityto do so costlessly. In this paper, we examine how Bayesian behavior changes if the costlessprocessing assumption is relaxed. We show that costly information processing can leadto mechanisms such as those posited by “early selection” models in the psychology ofattention. We also show that costly information processing has significant consequencesfor behavior: in some dynamic settings, even small information costs can lead to lack ofbelief convergence (i.e., permanent belief cycling); in other settings, processing costs canlead agents to converge to incorrect beliefs. Thus, we see that even a small, arguablyrealistic departure from the standard Bayesian information processing models can lead tosignificant changes in behavior.

Our setup is simple: We consider an agent who is uncertain about the way the worldworks. The agent is a Bayesian, and begins with a (prior) probability distribution overpossible models of the world. The agent receives “signals” about the true model of theworld, and can choose whether to internalize signal information by updating his prior viaBayesian inference. Updating is costly, but internalizing signal information is valuablebecause it helps the agent solve asset allocation problems that arise each period.2

We show that agents’ optimal strategy can be implemented using an elegant algorithm:Upon receiving a signal, agents perform a Bayesian update in a low-dimensional space tocompute a “decision value” that determines whether the signal provides enough useful,new information to be worth internalizing. If the signal’s decision value exceeds the costof updating, then agents perform the full Bayesian update. In a world where most signalsare not useful, this updating mechanism is an efficient solution to the agent’s problem inthe sense that it leads to higher payoffs net of processing costs than choosing to alwaysupdate beliefs.

We consider decisions in both a single period and a dynamic version of our model.We show that when Bayesian updating is costly, agents can exhibit several behaviors thatlead to significant departures from the traditional costless Bayesian baseline. First weshow that agents can reject a large proportion of signals even if updating costs are verysmall. Second, we show that when updating is costly, agents are especially attentive tosignals suggesting that rare events are more likely than priors suggest; in other words,

1For example, Bayesian inference is crucial to the common economic modeling assumption that agentshave correct expectations regarding how the world functions: a Bayesian exposed to free-flowing infor-mation will, in the long run, always learn the true state of the world (see Proposition 3).

2We focus on the asset allocation problem because it is important in many economic applications andbecause it has two useful technical features: first, asset allocation is continuous; and second, in assetallocation, better informed agents always have ex-ante higher expected payoffs.

2

Page 3: Paying (for) Attention: The Impact of Information ...Paying (for) Attention: The Impact of Information Processing Costs on Bayesian Inference Scott Duke Kominersy Xiaosheng Muz Alexander

individuals optimally attend more closely to signals that suggest that a rare event is morecommon than it seems (e.g., hearing about an airplane crash) than to signals that a rareevent is, in fact, very rare (e.g., a government report on the general safety of airplanes).In the dynamic version of our model we show that agents can exhibit permanent beliefcycling, that is, even as agents receive infinite information, their beliefs need not convergeto any single point. Even more starkly, we show that it is easy to construct exampleswhere agents converge to completely incorrect beliefs with probability 1.

Although agents in our model exhibit seemingly irrational behaviors, these behaviorsare not a result of “irrationality” on the part of agents. Indeed, agents in our model arealways rationally optimizing, given the constraint that Bayesian inference is computation-ally costly.

There is ample psychological evidence that individuals display seemingly “irrational”patterns of belief updating. However, often these patterns can seem to be at odds witheach other. For example, some experiments show that individuals update more conserva-tively than Bayesian updating would predict (Edwards (1982))—i.e., they are undersen-sitive to signals they receive. Other research shows that individuals do not pay enoughattention to base rates (Bar-Hillel (1980)). Finally, a third line of research shows thatindividuals appear to overweight rare events and underweight common events (Prelec(1998)). While some of the behaviors just described may seem at odds with each other,our model makes predictions that can rationalize their coexistence: In the presence ofupdating costs, some kinds of information structures can make agents look conservative,while others can make agents look like they overweight the probability of rare events.Thus costly Bayesian updating provides a principled and unified formulation that canexplain why certain biases may exist in some environments and not others.

Our model also explicitly links to the psychology and neuroscience of attention. Theoptimal strategy for a costly Bayesian updater can be implemented by a system quite sim-ilar to those described in the literature on the early selection theory of attention (Pashlerand Sutherland (1998)): First, a fast, parallel system analyzes signals the agent receivesand figures out which pieces of information are important enough for internalization. Im-portant signals are then passed on to a slower, serial system for further processing—in thiscase, Bayesian inference. Thus, agents who must pay costs to perform Bayesian updatescan exhibit “cocktail party effects” (Bronkhorst (2000)—experiments on perception showthat important ambient signals (e.g., hearing one’s name), are readily attended to, whilemost incoming information is rejected/lost) and “flashbulb memories” (in which certainepisodes are recorded much more vividly in memory than others—see Brown and Kulik(1977)).

Finally, our model predicts particular irrationalities in important economic decisions.For example, agents who have little “decision value” from having more accurate beliefswill update less often. This has implications for political decision making—for example,because each individual has near-zero chance of being the marginal voter, there is littleincentive to update beliefs about political issues, and we should expect belief traps (i.e.,potentially incorrect beliefs that are stable and hard to change) to be quite common.

Alternative constraints on belief processing have been considered in the prior literature.Wilson (2014) characterizes optimal (Markovian) memory protocols for agents that have

3

Page 4: Paying (for) Attention: The Impact of Information ...Paying (for) Attention: The Impact of Information Processing Costs on Bayesian Inference Scott Duke Kominersy Xiaosheng Muz Alexander

a finite number of memory modules. In the two-state setting Wilson (2014) considers,an agent optimally associates each memory module with an interval of beliefs and—afterobserving a signal—transitions to the module whose corresponding interval contains theposterior. Although Wilson (2014) assumes that the agent forgets about the realizedhistory of signals, Wilson (2014) allows the agent to fully introspect on the source of hisbeliefs by Bayesian updating over all possible signal histories; in our dynamic model, thischannel of introspection is shut down. Like us, Wilson (2014) finds that agents shouldignore signals that are mildly informative given the prior (see also Compte and Postlewaite(2012)). Consequently, information processing is “non-commutative” and, in particular,first impressions matter. Contrary to the Wilson (2014) model, however, agents facinginformation processing costs (as in our model) do not succumb to confirmation bias;rather, they regard evidence opposite to the prior as having higher decision value (see thediscussion on belief cycling later in the paper).

Steiner and Stewart (2006) study the optimal design of probability perception whenthere is noisy information at the time of decision making (see also Gossner and Steiner(2016)). Steiner and Stewart (2006) show that the well-known biases of overweightingsmall probabilities and underweighting large probabilities arise as a second-best solutionto correct for the winner’s curse effect. In our framework, agents also face friction ininformation processing. Yet, rather than holding the friction fixed and analyzing ex-anteperception, we allow agents to repeatedly interact with the friction by choosing whichsignals to incorporate at a cost and which ones to ignore. Although our prediction thatagents pay more attention to rare events bears resemblance to the finding of Steiner andStewart (2006), in our context this is because rare events entail higher decision values.

Other models assume that agents have information processing constraints either be-cause they must use one of a finite number of potential learning algorithms (Brock andHommes (1997)) or because the amount of information (in a Shannon sense) that theycan process is limited (Sims (2003)). The way our agents sometimes ignore signals is com-patible with an attempt to satisfy limited capacity. Note however that any fixed upperbound to information processing becomes non-binding as beliefs approach the extreme,so that a rationally inattentive agent always updates at extreme beliefs. Thus the modelof rational inattention cannot produce belief traps of the type we find.3

Schwartzstein (2014) considers a model in which agents are Bayesian, yet selectivelyattend only to aspects they believe are important to prediction. With a sufficientlystrong (and wrong) prior, a selectively attentive agent may “persistently fail to attend toinformation even if it would greatly improve forecast accuracy.” The bad, self-confirmingoutcome Schwartzstein (2014) finds largely depends on the fact that the agent internalizesinformation via a model that is sometimes misspecified. This does not occur in ourframework because our agents face processing costs instead of model misspecification.4

Fryer et al. (2015) introduce a special signal (consisting of a pair of opposite signals)that the agent remembers as favoring his current bias and then updates upon; this leads

3Gabaix (2014) considers a rational inattention model in which the agent “takes into account only a few(variables) that are important enough to change his decision”; our model embodies the same idea, withsignals replacing variables. We also consider dynamic decision problems with inter-temporal tradeoffs.

4Benjamin et al. (2016) also consider information processing under misspecified models.

4

Page 5: Paying (for) Attention: The Impact of Information ...Paying (for) Attention: The Impact of Information Processing Costs on Bayesian Inference Scott Duke Kominersy Xiaosheng Muz Alexander

to double updating and confirmatory bias. In our setting, by contrast, perfectly mixedsignals are simply ignored because they have no decision value. Benjamin et al. (2016)explore the dynamic implications of updating with base-rate-neglect. Like us, Benjaminet al. (2016) find that agents do not learn the true state in the long run. Contrary to ourbelief-trap result, however, agents with base-rate-neglect always assign a non-vanishingweight to the most recent signal, and their beliefs do not converge along any history.

Entropy formulas related to those we find have also appeared in the literature on costlyinformation acquisition. Cabrales et al. (2013b) rank information structures in terms oftheir value for an investor with increasing CRRA and ruin-averse utility. The Cabraleset al. (2013b) ranking is quantified in terms of the expected reduction in entropy betweenprior and posterior. The similarity to our results is due to log utility being prominent(see Lemma 2 of Cabrales et al. (2013b)). However in the Cabrales et al. (2013b) setup,the cost of information is paid ex ante (as an information acquisition cost), whereas theagent in our model sees the signal first and then decides whether or not to pay to update(see also Cabrales et al. (2013a); Bergemann and Bonatti (2015)).5

Our results are particularly important in the context of equilibrium analysis in eco-nomics. A common justification for the focus on equilibrium in many economic modelsis an argument that equilibria arise as a result of a dynamic learning process (Fuden-berg and Levine (1996)). That is, experience and exposure to information should leadBayesian agents to converge to optimal behavior/correct beliefs.6 Our results contributeto a growing literature that suggests that this assumption is both theoretically and empir-ically fragile and thus unlikely to hold in many real situations (Haruvy and Erev (2002);Fudenberg and Peysakhovich (2014)).

2 Baseline Setup

We consider an agent who must allocate 1 unit of assets across possible outcomes Ω =ω1, ω2, . . . , ωk. The true outcome is generated by a distribution, denoted by θ∗—the(true) model of the world. Our agent’s beliefs about the world are based on a setΘ = θ1, θ2, . . . , θn of (possible) models of the world, which contains the true model,θ∗; the agent has probabilistic beliefs p ∈ ∆(Θ).7 We assume that the agent does notknow the true model θ∗ but starts with prior beliefs p0 ∈ ∆(θ) over models of the worldputting positive weight on θ∗ (i.e. p0(θ

∗) > 0).

5Our model is also related to that of Benabou and Tirole (2002), in which there is a natural rate λN ofrecalling information, but the agent can alter that rate by paying a cost. The theory we present specializesto the case that λN = 0, with the only possible alteration being perfectly incorporating a signal. We findinteresting dynamics unexpected from the static setup of Benabou and Tirole (2002). Further papers onbelief manipulation include those of Benabou and Tirole (2004), Compte and Postlewaite (2004), andBrunnermeier and Parker (2005). Papers on memory and recall include those of Mullainathan (2002),Gennaioli and Shleifer (2010), and Bordalo et al. (2015).

6We do note that recent literature has allowed for individuals to hold incorrect, “self-confirming”beliefs in the long-run about events they do not actually observe (Fudenberg and Levine (1993)). But ina world of free flowing information, self-confirming equilibria cannot arise.

7We use the standard notation ∆(X) for the set of probability distributions over set X.

5

Page 6: Paying (for) Attention: The Impact of Information ...Paying (for) Attention: The Impact of Information Processing Costs on Bayesian Inference Scott Duke Kominersy Xiaosheng Muz Alexander

When an outcome actually occurs, the agent gets utility given by u(ci) where ci isthe amount of assets he allocated to outcome i. We use the asset allocation problem asour example because it is simple, has economic meaning in many contexts, and implies adirect value for information. Asset allocation problems like the one we study have beenused as idealizations of many types of decision problems. For economic applications, theset of outcomes could be possible returns on bonds and equities; the agent’s problem isthen to pick an optimal allocation of resources in the market. Asset allocation can alsomodel an agent’s choice of an optimal level of insurance or precaution. In these generalsettings, as well as in our model, an agent who has a more accurate estimate of the truestate of the world will always earn a higher payoff in expectation.

Before allocating his assets, the agent receives a signal s drawn from the set S =s1, s2, . . . , sn. Given a model of the world θ, the probability of a given signal s is λ(θ)[s].

The final component of our model is our departure from the standard Bayesian agentmodel: performing Bayesian updating is costly. After seeing a signal s, the agent chooseswhether to internalize the signal by updating his prior. Updating requires payment ofa (fixed) cost c, and results in the replacement of the prior beliefs p0 with the posteriorbeliefs p(s). If the agent does not update, then he pays no cost but the signal informationis lost.

We consider first the case of a single period update, this can also be thought of asan agent who makes decisions and updates each period but has a discount rate of δ = 0.After explaining our baseline results, we analyze how they change in a dynamic worldwith δ > 0.

3 Optimal Behavior under Costly Updating

For a distribution p on models of the world θ ∈ Θ, we let pΩ ∈ ∆(Ω) be the probabilitydistribution on Ω defined by

pΩ[ω] =∑θ∈Θ

p(θ)θ(ω),

that is, the distribution on outcomes implied by p. With this notation, pΩ[ω] is the bestestimate of the probability that ω will be the outcome, given beliefs p about the modelsof the world.

Definition 1. The decision value of a signal s, given prior p0 is

EUpΩ(a∗(pΩ))− EUpΩ(a

∗(pΩ0 )),

where p(s) ∈ ∆(Θ) is the agent’s posterior given s, a∗(p) represents the optimal actionunder beliefs p, and EUp(a) represents the expected utility of action a under beliefs p.

For simplicity, from here we assume that the agent has log-utility; that is, consumptionof k units is worth log(k) utils. This functional form assumption (combined with theproblem setup) drives the functional forms of our results, but not the spirit of the resultsthemselves. Indeed, we think a fruitful set of future work, beyond the scope of this paper,

6

Page 7: Paying (for) Attention: The Impact of Information ...Paying (for) Attention: The Impact of Information Processing Costs on Bayesian Inference Scott Duke Kominersy Xiaosheng Muz Alexander

is to provide an axiomatic foundation for utility functions and decision problems whichdo or do not lead to generalized forms of the pathological behaviors that we find.

The assumption of log utility gives a particularly clear connection between our agent’sproblem and a well-studied problem in information theory. Indeed, a fundamental ques-tion in information theory is determining when two probability distributions on a spaceare similar or different. This necessitates defining a distance metric and a very commonlyused one is Kullback-Leibler (KL) divergence (Kullback and Leibler (1951)), which isdefined as follows.

Definition 2. The Kullback-Leibler (KL) divergence of distribution Q from distributionP defined on a common discrete set Ω is defined as

D(P || Q) =∑ω∈Ω

P (Ω) log

(P (ω)

Q(ω)

). (1)

While (1) looks complex, the high-level intuition is straightforward: effectively, KLdivergence measures the inefficiency of using distribution Q to approximate distributionP . KL divergence is not a metric in the strict sense, as in general it is not symmetric(that is, D(P || Q) = D(Q || P )). Nevertheless, KL divergence has proven to be a veryuseful tool with deep connections to problems in fields from statistics to signal processingto artificial intelligence.

We now show that there is an elegant characterization of the agent’s optimal updatingstrategy in terms of KL divergence.

Proposition 1. Given prior p0, signal s and log utility, the agent updates if and only if

D(p(s)Ω||pΩ0 ) > c,

where p(s) ∈ ∆(Θ) is the agent’s posterior given s, D(p(s)Ω || pΩ0 ) is the Kullback-Leiblerdivergence of the prior from the posterior, and c is the processing cost.

Proof. Recall that in the asset allocation problem with log utility, the first-order conditionsset

ci = pΩ(ωi).

This means that if the agent uses pΩ0 to make the allocation decision, he ends up with atrue expected utility of ∑

ω

pΩ(ω) log(pΩ0 (ω)). (2)

However, if the agent acts optimally and uses the true posterior to make his allocationdecision, he has expected utility ∑

ω

pΩ(ω) log(pΩ(ω)). (3)

Subtracting (2) from (3) yields∑ω

pΩ(ω)(log(pΩ(ω))− log(pΩ0 (ω))) =∑ω

pΩ(ω)

(log

(pΩ(ω)

pΩ0 (ω)

)),

7

Page 8: Paying (for) Attention: The Impact of Information ...Paying (for) Attention: The Impact of Information Processing Costs on Bayesian Inference Scott Duke Kominersy Xiaosheng Muz Alexander

which is exactly the KL divergence of the prior from the posterior (projected to Ω space).The result then follows, as the agent internalizes the signal if and only if the expectedreturn from doing so exceeds the cost c.

Proposition 1 tells us the exact conditions under which updating is useful, but comput-ing a decision value itself requires Bayesian updating. How can agents solve this problem?The key here is that the KL divergence only needs to be computed on the projection ofthe full distribution onto Ω—thus, in some sense the agent only needs to do a “partial”update to check decision values.

We now illustrate how Proposition 1 implies that agents can use a two-step algorithmthat is more efficient than full updating so long as Bayesian updating in a larger space ismore computationally intensive.

First, we consider (p0,Θ, S,Ω, λ), the complete decision space. Given all of the compo-nents of the problem described above, a fully Bayesian decision-maker has a distributionµ ∈ ∆(Θ × S × Ω) that fully describes the joint probability of any model, signal andoutcome combination. To update this distribution given a signal s, the decision-makermust make some number of computations, which we denote k1.

As a different strategy, the decision-maker may instead choose to look at the muchlower-dimensional distribution ν that is the projection of µ onto the space (S ×Ω). Thisspace is of lower-dimension than (Θ×S×Ω); thus, updating in (S×Ω), given a signal s,is simpler, having processing cost k2 < k1. A strategy to implement the computation ofthe threshold suggested by Proposition 1 is for the decision-maker to update ν in (S×Ω)and use that posterior to determine whether he should spend the computational resourcesrequired to update µ in the full decision space.

In a world where all signals are highly informative and worth internalizing, “gate-keeper” computation of ν is a waste of resources. However, in a world where only aproportion q < 1 of signals are worthwhile, the decision-maker pays an expected updat-ing cost of k2 + q · k1 from following the two-step strategy and pays a sure cost of k1 ifhe always updates. Thus, if q is small or if k1 is large relative to k2—e.g., if the set ofmodels is large but the set of outcomes is small—then the gatekeeper strategy conserveson computational cost while still allowing the agent to internalize important information.

The “gatekeeper” solution is exactly the kind of system that is set up by early-selectiontheories in the psychology of attention (Pashler and Sutherland (1998)). In the modelsposited in that literature, agents are endowed with two information processing systems, afast one that can process information in parallel, but only at a cursory level, and a slowerone that can look at information more in depth, but must do so serially. In such models,the parallel system screens incoming stimuli. After screening, signals deemed unimportantare ignored (e.g., during a conversation at a cocktail party, the lower level system filtersout other conversations), while those deemed important are passed “upstairs” for furtherprocessing (as when someone at a party yells the agent’s name). This division of labor isa mechanism that can exactly solve our agent’s problem: the faster, “gatekeeper” systemcan compute the decision value and, if need be, push the signal to the slower serial systemfor internalization.

Of course, one may ask: why not simply always update in the lower dimensional space

8

Page 9: Paying (for) Attention: The Impact of Information ...Paying (for) Attention: The Impact of Information Processing Costs on Bayesian Inference Scott Duke Kominersy Xiaosheng Muz Alexander

(S × Ω) and “forget” the existence of (Θ × S × Ω)? Actually, just forming beliefs overthe distribution of ω is not sufficient for the agent in a dynamic sense. The issue is thatif an agent just maintains beliefs about ω over multiple updates, then this agent wouldtreat signals and outcomes as independent, whereas in fact they are only independentconditional on θ. Consequently, he will learn the wrong distribution in the long run.Nevertheless, forming beliefs over ω is on its own sufficient for a myopic agent to figureout whether to update beliefs over θ.

A second, more psychological (or reduced-form) interpretation of Proposition 1 is thatthe decision value of a signal is more easy accessible than are the signal’s implicationsabout the respective likelihoods of different models of the world. For example, a newsreport about an event (e.g., a plane crash) is informative about both the likelihood of theevent (planes crashing) and also about deeper parameters of the world (the safety of theairline industry, the effectiveness of government agencies, and so forth) but the formertype of information is more easily cognitively accessible.

We now move on to examples showing how agents with costly updating differ inbehavior, sometimes greatly, from their costless Bayesian counterparts.

3.1 Example 1: Quantifying Costs

Proposition 1 shows that the addition of updating costs may cause agents to discard someinformation. We now show, by numerical example, that this effect distorts behavior awayfrom the standard (rational, with perfect updating) baseline model even when processingcosts are relatively small.

We consider an example with two possible models of the world, θ1 and θ2, and twopossible outcomes, ω1 and ω2. The probability of outcome ω1 under θ1 is .9, the probabilityof outcome ω1 under θ2 is .1. The agent has a prior that puts weight p0 on θ1 and 1− p0on θ2.

8 In this setup, the possible signals s can be characterized by an accuracy level

r ≡ λ(θ1)[s]

λ(θ1)[s] + λ(θ2)[s].

A signal with accuracy level r = 1 is a signal that θ1 is definitely the true model of theworld; a signal with accuracy level r = 0 is a signal that θ2 is definitely the true model.We now consider the agent’s decision to accept a signal with accuracy level r as a functionof his prior.

Given a signal of accuracy r, we can compute the agent’s value of internalizing thesignal. First, we determine the posterior on outcomes implied by the signal:

p(ω1) =p0r

p0r + (1− p0)(1− r)(.9) +

(1− p0r

p0r + (1− p0)(1− r)

)(.1).9

8Here and hereafter, when a distribution p is only over to objects, we abuse notation slightly by usingthe distribution notation p to represent the probability of one of the objects.

9Note that this computation is simple because there are only two possible models of the world. Ingeneral, this computation is much more difficult and the notion of signal accuracy is not so simple todefine. Nevertheless, as discussed in the preceding section, the decision value is in general relatively easyto compute.

9

Page 10: Paying (for) Attention: The Impact of Information ...Paying (for) Attention: The Impact of Information Processing Costs on Bayesian Inference Scott Duke Kominersy Xiaosheng Muz Alexander

We then compute the agent’s expected loss from using his prior instead of his posteriorin his allocation decision. The loss is the Kullback–Leibler divergence between posteriorand prior, given by:

p(ω1) log

(p(ω1)

p0(ω1)

)+ (1− p(ω1)) log

(1− p(ω1)

1− p0(ω1)

). (4)

Thus, if (4) is bigger than the updating cost c, then the agent should internalize theinformation; if not, he should discard the signal.

Clearly, there is a threshold updating cost c above which the agent ignores the signaland below which the agent updates. However, it is difficult to interpret the magnitude ofthis threshold without a unit conversion. We scale c to be a fraction of the total possibleexpected utility gains between a completely informed agent (who has expected utility of.9 log(.9) + .1 log(.1)) and a completely uninformed agent (who has expected utility oflog(.5)). That is, c is written as a fraction of the term

(.9 log(.9) + .1 log(.1))− log(.5) ≈ .16.

Figure 1 shows how an agent with prior beliefs p0 reacts to signals of accuracy r.Note that even if c is 10% of the informational value of the problem, the agent will,

for any prior, reject signals that have accuracy in the interval [13, 23]. Thus costly updating

has a large distortionary effect on behavior, relative to the costless updating baseline.

3.2 Example 2: Attention to Rare Events

Individuals tend to overestimate the probability of rare events like plane crashes andterrorist attacks (see, e.g., Barberis (2013)). In the behavioral economics literature, thisoverestimation has been modeled as accidental distortion of probabilities (Kahnemanand Tversky (1979)). Our model, however, suggests a different explanation (not itselfincompatible with probability weighting): when updating is costly, agents’ beliefs tend togravitate towards models of the world in which rare events are more common than theyactually are.

To see this, consider an example with agents who live in a big city. The agents facea problem that can result in one of two outcomes Ω = ρ (robbed), σ (safe). The agentshave two possible models of the world: Θ = θs, θd. We suppose that θs(σ) = .99 andθd(σ) = .8, so that robbery is always rare, but much rarer under θs than under θd. Wecan interpret θs as a model in which “the city is safe”; θd is a model in which “the city isdangerous.”

We consider two agents who have the same updating cost c. One agent, whom we callDirk, starts out thinking that the city is dangerous, and the other, whom we call Stephen,starts out thinking that the city is safe. Both agents are equally certain of their (differing)models of the world: Dirk puts prior probability p0 > .5 on θd, while Stephen puts priorprobability p0 on θs.

We examine what happens when both agents receive equally strong signals suggestingthat their beliefs are wrong. Formally, Stephen receives a signal sd that has accuracy level

10

Page 11: Paying (for) Attention: The Impact of Information ...Paying (for) Attention: The Impact of Information Processing Costs on Bayesian Inference Scott Duke Kominersy Xiaosheng Muz Alexander

0.00

0.25

0.50

0.75

1.00

0.00 0.25 0.50 0.75 1.00Prior (p)

Sig

nal (

r)

Action Reject Accept

C = .1

Figure 1: Combinations of prior, p0, (horizontal axis) and signal accuracy, r, (verticalaxis) that lead the agent to accept (blue) or ignore (red) for c = 1

10(times .16).

11

Page 12: Paying (for) Attention: The Impact of Information ...Paying (for) Attention: The Impact of Information Processing Costs on Bayesian Inference Scott Duke Kominersy Xiaosheng Muz Alexander

r < .5 and Dirk receives a signal ss that has the same accuracy level but in the otherdirection, namely, (1− r).

Our next result shows that quite generally the value of Dirk’s signal is higher thanthat of Stephen’s.

Proposition 2. For a fixed p0 > .5, suppose that r ∈ (0, .5) is such that the posteriorbelief of Stephen puts probability at least 1− p0 on the model θs. Then the decision valueof the signal sd for Stephen is strictly higher than the decision value of the signal ss forDirk.

Proposition 2 illustrates that, given costly updating, agents who believe that the cityis safe—that is, agents who think that rare, dangerous events are very rare—are veryresponsive to information that suggests otherwise. Meanwhile, agents who believe thatrare events are relatively common are not as responsive to equally strong informationsuggesting that rare events are, in fact, rare.

Note that Proposition 2 is only true if the signal is not strong enough to completelyoverturn the prior. To see why r cannot be too small, consider the above setup withp0 = .6 and r = .1 so that Stephen’s posterior puts probability 1

7on θs. His posterior

estimate of the probability of the “safe” outcome is about .827, compared to the priorestimate .914. From Proposition 1, we compute that Stephen’s decision value is about.0166. Similarly we can compute that the decision value for Dirk is .0183, higher thanStephen’s.

4 Belief Dynamics

We now explicitly consider the dynamics of our model. Suppose that the true modelof the world remains constant throughout. In this setting, the agent starts each periodt = 1, 2, . . . with a prior pt−1, receives a signal s drawn from probability distribution λ(θ∗),and chooses whether to internalize s. The agent then faces the allocation decision outlinedin our static model, with beliefs pt = pt−1(s) if s is internalized, and with beliefs pt = pt−1

if not. The agent then proceeds to the next period, with new prior pt. We assume thatthe agent discounts future payoffs at rate δ ∈ [0, 1).10

Note that we allow the agent to consider future information value in deciding whetherto internalize signals. In particular, the agent could take into account the fact that hemay choose not to internalize signals in the future. This kind of sophistication makes itdifficult to characterize exactly the decision value of a signal for a patient agent, unlikein the static/myopic case of Proposition 1. In our formal analysis, we show a bound onthe total value of a signal by assuming that the agent can internalize all future signals atno cost. We then compare that bound to the cost of internalizing the given signal and

10Here, δ can be interpreted in two ways: either the agent actually values future payoffs less thancurrent payoffs, and discounts the future exponentially, or the probability that he faces the asset allocationproblem in a given period is δ.

12

Page 13: Paying (for) Attention: The Impact of Information ...Paying (for) Attention: The Impact of Information Processing Costs on Bayesian Inference Scott Duke Kominersy Xiaosheng Muz Alexander

establish conditions under which the latter is larger. Consequently, our results extendunchanged regardless of what the agent believes his future selves will do.11

As a benchmark, we consider the case in which there is no updating cost. Here, theagent’s behavior is just as in standard Bayesian models.12

Proposition 3. Suppose that c = 0, that is, that the agent is a standard rational Bayesian.Then with enough experience, the agent learns the true model θ∗ with certainty:

pt(θ∗)

a.s.−−→ 1,

where the convergence is almost surely.

We now examine the dynamic implications of updating costs. First, we fix an infor-mation structure (λ, S) and an updating cost c.

Definition 3. A belief trap is a probability distribution p ∈ ∆(Θ) such that an agentwith prior beliefs p will not internalize any signal s ∈ S.

We say that an agent is in a belief trap if his beliefs p are a belief trap. Note thatan agent who starts a period in a belief trap will continue to hold his beliefs forever,irrespective of the signals he receives. Proposition 3 shows that when updating is costless,the only belief trap is the trivial one in which the agent knows the true state with certainty.However, we now show that under very general conditions on the information structure,as long as agents are not perfectly patient (δ < 1), strong beliefs about any possible modelof the world become belief traps.

Definition 4. An information structure (λ, S) has full support if for any choice of θ, swe have that λ(θ)[s] > 0.

Proposition 4. If c > 0 and the information structure has full support, then for any θand any δ < 1, there exists ϵ > 0 such that for any Ω, if the agent’s prior beliefs havep0(θ) > 1− ϵ, then the agent will reject any signal he receives, so that

pt(θ) > 1− ϵ, for all t.

When δ = 0, so that our agent is myopic, the characterization in Proposition 1 applies,allowing us to show that when ϵ is sufficiently small, no signal generates a decision valueexceeding the cost c.13

When δ > 0, Proposition 1 no longer holds because the updating decision in a givenperiod has repercussions for the future. In the absence of an analytic characterization,we instead give an upper bound on potential future benefits of updating and show it is

11That said, unlike in the model of Wilson (2014), we do not allow the agent to reflect on how he reacheshis current beliefs and perform delayed updating on signals that (he infers) must have been ignored inthe past.

12To make the model nontrivial, we assume that the signal structure is informative, that is for any θ, θ′

there exists s such that λ(θ)[s] = λ(θ′)[s].13We use the full-support condition to bound the difference between posterior and prior distributions.

13

Page 14: Paying (for) Attention: The Impact of Information ...Paying (for) Attention: The Impact of Information Processing Costs on Bayesian Inference Scott Duke Kominersy Xiaosheng Muz Alexander

smaller than c. The method of proof is to choose ϵ much smaller than in the case δ = 0,so that any notable utility gain only happens at least T periods into the future. For largeT , these benefits are also negligible because of discounting.

Proposition 4 shows that belief traps are stable sets for belief dynamics under costlyupdating. However, Proposition 4 does not show that belief traps are attractors. Ournext result shows that when updating is costly, beliefs may not even converge —beliefdynamics (and thus behavior) can exhibit permanent cycling with probability 1.

We first give some intuition for our results. To simplify the analysis, we restrict to thecase of two possible outcomes Ω = 0, 1, two models Θ = a, b (with a being the truemodel) and two signals, S = A,B. To fully specify the decision problem, we need tostate the conditional probabilities of outcomes given models and of signals given models.The table below shows an economically and computationally simple example:

OUTCOMES/STATES a b0 .8 .21 .2 .8

SIGNALS/STATES a bA .85 .15B .15 .85

.

We now consider what happens when our agents face the problem dynamically. First,we restrict to myopic agents with δ = 0. Here, the conditions from Proposition 1 aresufficient to tell us whether an agent will update or not, given a signal. Figure 2 showsthe decision-value of accepting each of the signals conditional on a current prior. Thevalue of accepting signal A falls quickly as the agent comes to believe that a is verylikely; thus, there is a point that the agent stops accepting signal A. For example, if theagent has a decision cost of .07, then the agent will not accept signal A if he believesstate a has probability at least .59. Even if the agent were to believe that the state a is.59 likely and he received (and accepted) signal A. he would still only have a posteriorprobability estimate of ∼ .891. However, at that point he would no longer accept signalA but he would accept signal B. The same argument applies symmetrically for signal B.Thus, there is no way for the agent to get into the region where he stops accepting allsignals; hence, the agent will exhibit persistent belief cycling (and so his beliefs will haveno definable limit as t → ∞).

This example shows that we can construct decision problems that lead to belief cyclingfor δ = 0. The next proposition shows that information structures which lead to cyclingbeliefs can exist for any level of δ and updating cost.

Proposition 5. Fix δ ∈ [0, 1) and c > 0. There exists a decision problem (Θ,Ω) and anaccompanying information structure (λ, S) such that there exists a non-empty open set Bwith the following property: if p0 ∈ B, then

prob(limt→∞

pt(θ∗) is undefined

)= 1.

We note that in the proof of our proposition, we do not explicitly derive the optimalpolicy. Rather, we bound the utility loss for actions from above. We do not attempt tocompletely derive the optimal policy for fully sophisticated agents (e.g., using a Bellman

14

Page 15: Paying (for) Attention: The Impact of Information ...Paying (for) Attention: The Impact of Information Processing Costs on Bayesian Inference Scott Duke Kominersy Xiaosheng Muz Alexander

0.000

0.025

0.050

0.075

0.100

0.125

0.00 0.25 0.50 0.75 1.00Prior Probability of a

Dec

isio

n V

alue

Signal A B

How Can Belief Cycling Occur?

Figure 2: Sufficient conditions for cycling behavior by costly updating agents.

equation) because (1) doing so is very difficult and (2) we view full sophistication asan abstract benchmark, not a description of actual behavior. We argue that if evenmaximally sophisticated agents can exhibit belief cycling (or, as we show below, convergeto completely incorrect beliefs), then we should also expect similar behaviors in moreboundedly rational agents.

So far we have shown that it possible for agents to get stuck in belief traps and also foragents to cycle forever. Our final result shows that we can construct decision problemssuch that no matter how low the agent’s updating cost is, his beliefs will eventuallyconverge to a point far away from the true model of the world unless his prior beliefsare very close to the truth. In contrast with the conclusion of Proposition 3, this resultexhibits a surprising discontinuity when the updating cost moves away from zero.

Proposition 6. Fix δ ∈ [0, 1), p0 < 1 and η > 0. There exists a decision problem(Θ,Ω) and an accompanying information structure (λ, S) such that an agent with priorp0(θ

∗) = p0, discount factor δ and any small updating cost c > 0 exhibits the following:

prob(limt→∞

pt(θ∗) < η

)= 1.

We sketch the intuition behind the proof here. Consider an information structure withtwo states and two signals as follows:

SIGNALS/STATES a bA 1− ϵ 1− λϵB ϵ λϵ

.

15

Page 16: Paying (for) Attention: The Impact of Information ...Paying (for) Attention: The Impact of Information Processing Costs on Bayesian Inference Scott Duke Kominersy Xiaosheng Muz Alexander

Suppose that ϵ is very small and λ is very large. In this case, Signal A is very weaklyinformative that the state is a; hence, its decision value is almost 0, and for a fixed c itcan be shown that the agent will simply never accept signal A. At the same time, λ islarge, so that signal B is very informative that the state is, in fact, b. Thus, even if thetrue state is a, the agent will accept only signal B (unless he begins at a prior where heaccepts no signals). In a dynamic model, this means the agent will converge to believing bvery strongly. Note however, the proposition asserts an information structure that appliesto every c. In the limit as c → 0, the preceding argument fails because the agent mustaccept signal A at some priors. Fortunately, we can still show that signal B is acceptedwhenever signal A is. That will again imply that the agent converges to the wrong beliefb in the long run under costly updating.

5 Conclusion

We have shown that the addition of processing costs to Bayesian inference leads rationalagents to adopt strategies of “pre-screening” signals before fully internalizing them. Thiscauses individuals to ignore some information, and can have distortionary effects on be-havior even when costs of processing are small. Additionally, costly processing producesa form of probability overweighting, leading individuals to react more strongly to infor-mation suggesting that unexpected events are more common than to equivalently stronginformation suggesting that common events are rare. Finally, individuals for whom pro-cessing is costly can sometimes converge to having arbitrarily strong, but wrong, beliefseven in a world of free-flowing information.

There is a large set of psychological findings on biases in human probability judgment.However, often times the observed biases can appear to be somewhat contradictory—e.g., some strands of research show that individuals are conservative (i.e., do not updateenough) while others show that individuals can update too much (e.g., ignore base rates orotherwise overweight the likelihood of a rare event). Our model can help to reconcile therange of findings on beliefs because when Bayesian processing is costly there are situationswhere individuals can be maximally conservative (as when caught in a belief trap), butin other situations they can overweight rare events.

As a general point, we have shown that small changes in the assumptions of thestandard Bayesian model give rise to behaviors which are, in both the short-run and inthe long-run, large departures from classical Bayesian predictions. Given that standardBayesian-based learning models are, at best, approximations of human decision-making,such lack of robustness may mean that many standard intuitions gleaned from Bayesian-based models are incorrect. Having better understanding of the kinds of biases humanlearning actually exhibits (e.g., Bar-Hillel (1980), Barberis (2013), Peysakhovich and Kar-markar (2016), Peysakhovich and Rand (2015), Fudenberg and Peysakhovich (2014)) aswell as formal theories that explain the cognitive mechanisms which underlie those biases(e.g., Bordalo et al. (2015), Fryer et al. (2015), Mullainathan (2002)) is an important com-ponent of building a more robust and applicable model of human information processing.

16

Page 17: Paying (for) Attention: The Impact of Information ...Paying (for) Attention: The Impact of Information Processing Costs on Bayesian Inference Scott Duke Kominersy Xiaosheng Muz Alexander

References

Bar-Hillel, M. (1980). The base-rate fallacy in probability judgments. Acta Psycholog-ica 44 (3), 211–233.

Barberis, N. (2013). The psychology of tail events: Progress and challenges. AmericanEconomic Review 103 (3), 611–616.

Benabou, R. and J. Tirole (2002). Self-confidence and personal motivation. QuarterlyJournal of Economics 117 (3), 871–915.

Benabou, R. and J. Tirole (2004). Willpower and personal rules. Journal of PoliticalEconomy 112 (4), 848–886.

Benjamin, D. J., A. Bodoh-Creed, and M. Rabin (2016). Base-rate neglect: Model andapplications. Working Paper.

Benjamin, D. J., M. Rabin, and C. Raymond (2016). A model of non-belief in the law oflarge numbers. Journal of the European Economic Association 14 (2), 515–544.

Bergemann, D. and A. Bonatti (2015). Selling cookies. American Economic Journal:Microeconomics 7 (3), 259–294.

Bordalo, P., N. Gennaioli, and A. Shleifer (2015). Memory, attention and choice. WorkingPaper.

Brock, W. A. and C. H. Hommes (1997). A rational route to randomness. Econometrica:Journal of the Econometric Society , 1059–1095.

Bronkhorst, A. W. (2000). The cocktail party phenomenon: A review of research on speechintelligibility in multiple-talker conditions. Acta Acustica united with Acustica 86 (1),117–128.

Brown, R. and J. Kulik (1977). Flashbulb memories. Cognition 5 (1), 73–99.

Brunnermeier, M. K. and J. A. Parker (2005). Optimal expectations. American EconomicReview 95 (4), 1092–1118.

Cabrales, A., O. Gossner, and R. Serrano (2013a). The appeal of information transactions.Working Paper.

Cabrales, A., O. Gossner, and R. Serrano (2013b). Entropy and the value of informationfor investors. American Economic Review 103 (1), 360–377.

Compte, O. and A. Postlewaite (2004). Confidence-enhanced performance. AmericanEconomic Review 94 (5), 1536–1557.

Compte, O. and A. Postlewaite (2012). Belief formation. Working Paper.

17

Page 18: Paying (for) Attention: The Impact of Information ...Paying (for) Attention: The Impact of Information Processing Costs on Bayesian Inference Scott Duke Kominersy Xiaosheng Muz Alexander

Cover, T. M. and J. A. Thomas (2012). Elements of Information Theory. John Wiley &Sons.

Doya, K. (2007). Bayesian Brain: Probabilistic Approaches to Neural Coding. MIT Press.

Edwards, W. (1982). Conservatism in human information processing. In A. T. Daniel Kah-neman, Paul Slovic (Ed.), Judgment under Uncertainty: Heuristics and Biases, pp.359–369. Cambridge University Press.

Epstein, L. G. and M. Le Breton (1993). Dynamically consistent beliefs must be Bayesian.Journal of Economic Theory 61 (1), 1–22.

Fryer, R. G., P. Harms, and M. O. Jackson (2015). Updating beliefs with ambiguousevidence: Implications for polarization. Working Paper.

Fudenberg, D. and D. K. Levine (1993). Self-confirming equilibrium. Econometrica 61 (3),523–545.

Fudenberg, D. and D. K. Levine (1996). The Theory of Learning in Games. MIT Press.

Fudenberg, D. and A. Peysakhovich (2014). Recency, records, and recaps: The effect offeedback on behavior in a simple decision problem. In Proceedings of the 15th ACMconference on Economics and Computation, pp. 971–986.

Gabaix, X. (2014). A sparsity-based model of bounded rationality. Quarterly Journal ofEconomics 129 (4), 1661–1710.

Gennaioli, N. and A. Shleifer (2010). What comes to mind. Quarterly Journal of Eco-nomics 125 (4), 1399–1433.

Gopnik, A. (2010). How babies think. Scientific American 303 (1), 76–81.

Gossner, O. and J. Steiner (2016). Optimal illusion of control and related perceptionbiases. Working Paper.

Haruvy, E. and I. Erev (2002). On the application and interpretation of learning models.In Experimental Business Research, pp. 285–300. Springer.

Kahneman, D. and A. Tversky (1979). Prospect theory: An analysis of decision underrisk. Econometrica 47 (2), 263–292.

Knill, D. C. and W. Richards (1996). Perception as Bayesian inference. CambridgeUniversity Press.

Kullback, S. and R. A. Leibler (1951). On information and sufficiency. Annals of Mathe-matical Statistics 22 (1), 79–86.

Mullainathan, S. (2002). A memory-based model of bounded rationality. Quarterly Jour-nal of Economics 117 (3), 735–774.

18

Page 19: Paying (for) Attention: The Impact of Information ...Paying (for) Attention: The Impact of Information Processing Costs on Bayesian Inference Scott Duke Kominersy Xiaosheng Muz Alexander

Pashler, H. E. and S. Sutherland (1998). The Psychology of Attention. MIT Press.

Peysakhovich, A. and U. R. Karmarkar (2016). Asymmetric effects of favorable and unfa-vorable information on decision making under ambiguity. Management Science 62 (8),2163–2178.

Peysakhovich, A. and D. G. Rand (2015). Habits of virtue: Creating norms of cooperationand defection in the laboratory. Management Science 62 (3), 631–647.

Prelec, D. (1998). The probability weighting function. Econometrica 66 (3), 497–528.

Rudin, W. (1964). Principles of Mathematical Analysis. McGraw-Hill.

Schwartzstein, J. (2014). Selective attention and learning. Journal of the European Eco-nomic Association 12 (6), 1423–1452.

Sims, C. A. (2003). Implications of rational inattention. Journal of Monetary Eco-nomics 50 (3), 665–690.

Steiner, J. and C. Stewart (2006). Perceiving prospects properly. American EconomicReview 106 (7), 1601–1631.

Wilson, A. (2014). Bounded memory and biases in information processing. Economet-rica 82 (6), 2257–2294.

A Proofs Omitted from the Main Text

Let p, q ∈ (0, 1). We will abuse notation slightly by writing D(p||q) for the relative entropybetween two Bernoulli distributions with success probabilities p and q. We first collect apair of lemmata.

Lemma 1. For any p, q satisfying .5 ≤ p < q ≤ 1, we have D(p||q) > D(q||p).

Proof. We need to show that

p log

(p

q

)+ (1− p) log

(1− p

1− q

)> q log

(q

p

)+ (1− q) log

(1− q

1− p

),

or equivalently, that

(2− p− q) log

(1− p

1− q

)− (p+ q) log

(q

p

)> 0. (5)

The left-hand-side of (5) vanishes when p = q, so it suffices to show that the derivativeof that left-hand-side with respect to q is strictly positive. This reduces to verifying

q − p

q(1− q)− log

(q

p

)− log

(1− p

1− q

)> 0. (6)

19

Page 20: Paying (for) Attention: The Impact of Information ...Paying (for) Attention: The Impact of Information Processing Costs on Bayesian Inference Scott Duke Kominersy Xiaosheng Muz Alexander

Now, we observe that the left-hand-side of (6) vanishes when p = q and its derivativewith respect to p is 1

p(1−p)− 1

q(1−q)< 0. Thus, we have (6) and, consequently, (5); the

lemma follows.

Lemma 2. For any ∆ > 0 and any p, q satisfying .5 ≤ p < q ≤ 1 − ∆, we haveD(p || p+∆) < D(q || q +∆).

Proof. We fix ∆ and show that the expression

D(p || p+∆) = p log

(p

p+∆

)+ (1− p) log

(1− p

1− p−∆

)(7)

strictly increases in p. For the derivative of (7) to be positive, we need to check that

(p+∆)(1− p−∆)> log

((p+∆)(1− p)

p(1− p−∆)

). (8)

At p = 12, (8) reduces to the easy-to-verify inequality log 1+x

1−x< 2x

1−x2 . Taking the derivativeof (8) with respect to p again, it remains to show that

∆(2p+ 2∆− 1)

(p+∆)2(1− p−∆)2>

∆(2p+∆− 1)

p(1− p)(p+∆)(1− p−∆);

this is true whenever p ≥ .5, and the lemma follows.

Proof of Proposition 2

We let ps0 = .99p + .8(1 − p) be Stephen’s prior probability on the safe outcome and let

ps = .99pr+.8(1−p)(1−r)pr+(1−p)(1−r)

be the corresponding posterior probability following a signal with

accuracy r. Similarly, we let pd0 = .8p + .99(1 − p) and pd = .8pr+.99(1−p)(1−r)pr+(1−p)(1−r)

be Dirk’sprior and posterior estimates respectively.

For any r < .5 we have .5 < ps < ps0, .5 < pd0 < pd, and, importantly, ∆ ≡ ps0 − ps =pd−pd0. The condition that r is not too small further ensures that pd ≤ ps0, and so pd0 ≤ ps.So Lemma 1 and Lemma 2 together imply that

D(pd||pd0) < D(pd0||pd) = D(pd0||pd0 +∆) ≤ D(ps||ps +∆) = D(ps||ps0). (9)

This expression (9) says exactly that the decision value for Stephen is higher than thatfor Dirk.

Proof of Proposition 3

The likelihood ratio between the true model θ∗ and any alternative θ conditional on asequence of observed signals s1, s2, . . . , st is given by

lt(s1, s2, . . . , st) :=

t∑m=1

log(λ(θ∗)[sm])− log(λ(θ)[sm]).

20

Page 21: Paying (for) Attention: The Impact of Information ...Paying (for) Attention: The Impact of Information Processing Costs on Bayesian Inference Scott Duke Kominersy Xiaosheng Muz Alexander

Because the signals are i.i.d., generated from the distribution λ(θ∗), it follows from theStrong Law of Large Numbers that

1

tlt(s

1, s2, . . . st) → Eλ(θ∗)

[log

(λ(θ∗)[s]

λ(θ)[s]

)]= D(λ(θ∗) || λ(θ)) (almost surely). (10)

The distributions λ(θ∗) and λ(θ) are different, so their KL-divergence D(λ(θ∗) || λ(θ)) isstrictly positive. (10) implies that the likelihood ratio lt converges almost surely to ∞ forevery θ = θ∗. Thus, given sufficient experience, the true model θ∗ (which has non-zeroprior probability) will eventually receive near-1 posterior probability.

Proof of Proposition 4

As the information structure has full support, there exists γ > 0 such that λ(θ)[s] ≥ γfor every θ and s. Let M = 1

γ. We observe a simple lemma.

Lemma 3. Whenever an agent’s prior belief satisfies p0(θ) > 1− ϵ for some θ and ϵ > 0,his posterior belief must be such that p1(θ) > 1−Mϵ.

Proof. For the proof, it suffices to note that the posterior belief of any model θ′ = θ canincrease by a factor of at most M relative to the prior.

Now, we use the preceding lemma to provide a lower bound on the utility the agentexpects if he commits not to update. We assume without loss of generality that alloutcomes are possible given the prior p0, and denote N ≡ |Ω| < ∞.

In each period t, a non-updating agent forgets the signals he has seen and starts withthe same prior p0. Because he rejects all signals, the amount he allocates to outcomeω ∈ Ω is always pΩ0 (ω). His expected payoff in period t is thus

πt ≡∑ω∈Ω

pΩ1 (ω) · log(pΩ0 (ω)), (11)

where pΩ1 is the posterior distribution on outcomes conditional on the signal realized inperiod t. Lemma 3 implies that for any θ, p1(θ) and p0(θ) differ by at most Mϵ and byat most a multiple of M . It follows that for any ω, pΩ1 (ω) and pΩ0 (ω) satisfy the sameproperties. So we have from (11) that

πt ≥∑ω∈Ω

pΩ0 (ω) · log(pΩ0 (ω))−∑ω∈Ω

|pΩ1 (ω)− pΩ0 (ω)|| log(pΩ0 (ω))|

≥ −H(pΩ0 )−NM |ϵ log ϵ|

where H(·) is the entropy function (Cover and Thomas (2012)).14

Our non-updating agent thus ensures a total payoff of∑t≥1

δt−1πt ≥1

1− δ

(−H(pΩ0 )−NM |ϵ log ϵ|

). (12)

14The latter step follows from considering the cases pΩ0 (ω) ≥ ϵ and pΩ0 (ω) < ϵ separately.

21

Page 22: Paying (for) Attention: The Impact of Information ...Paying (for) Attention: The Impact of Information Processing Costs on Bayesian Inference Scott Duke Kominersy Xiaosheng Muz Alexander

We can also give an upper bound to the agent’s total payoff assuming that updatingis costless and that the agent can freely choose to update. For this let qt ∈ ∆(Θ) be theposterior that the agent knows about θ after seeing the signal in period t, and let pt bethe posterior that he retains. Regardless of the retention decision, qt(θ) ∼ pt−1(θ) ·λ(θ)[st](where st denotes the signal realized in period t). However pt = qt only when the agentaccepts the signal; pt = pt−1 otherwise.

The agent allocates according to pt at the end of period t, for an expected payoff of

πt ≡∑ω∈Ω

pΩt (ω) · log(qΩt (ω)) ≤ −H(qΩt ) (13)

where the inequality is due to KL-divergence being non-negative. From Lemma 3 andinduction we have pt(θ), qt(θ) ≥ 1 − M tϵ for each t. Thus the total variation distancebetween qt and p0 is at most M tϵ, which also bounds ||qΩt − pΩ0 ||.

Now take T sufficiently large such that δT

1−δH(pΩ0 ) <

c3. For this T , we take ϵ sufficiently

small such that 1−δT

1−δ|H(pΩ0 )−H(qΩ)| < c

3whenever ||pΩ0 − qΩ|| ≤ MT ϵ. This can be done

independently of pΩ0 because the entropy function is continuous in the total variationnorm, hence uniformly continuous (Rudin (1964, Thm. 4.19)). Making ϵ even smaller ifnecessary, we may assume that 1

1−δNM |ϵ log ϵ| < c

3.

From (13), the non-negativeness of H(·) and the preceding assumptions, we deducethat the total payoff for a freely updating agent is bounded above by∑

t≥1

δt−1πt ≤∑t≥1

δt−1(−H(qΩt ))

≤∑

1≤t≤T

δt−1(−H(qΩt ))

<∑

1≤t≤T

δt−1(−H(pΩ0 )) +c

3

<1

1− δ(−H(pΩ0 )) +

2c

3

≤∑t≥1

δt−1πt + c

(14)

where the last step follows from (12).Now, (14) shows that even when updating is costless, the agent cannot improve her

total payoff by c relative to a counterfactual in which she commits not to update. Asa corollary, we see that when the updating cost is c, it is never worthwhile to pay theupdating cost of c in any period (given any signal); this completes the proof of theproposition.

22

Page 23: Paying (for) Attention: The Impact of Information ...Paying (for) Attention: The Impact of Information Processing Costs on Bayesian Inference Scott Duke Kominersy Xiaosheng Muz Alexander

Proof of Proposition 5

We will first consider a myopic agent facing the following information structure:

θ∗ θ′

s∗ 1r+1

rr+1

s′ rr+1

1r+1

with some small positive r. Assume throughout the proof that outcomes are simply gen-erated by the point-mass distribution at the model.

For p ∈ (0, 1), define a pair of functions:

Fr(p) =p

p+ (1− p) · r;

Gr(p) =p · r

p · r + (1− p).

If an agent assigns prior probability p to model θ∗, his posterior becomes Fr(p) giventhe signal s∗ and Gr(p) given the signal s′. Notice that Fr and Gr are inverse functions,corresponding to the fact that after an agent sees s∗ followed by s′, his posterior equalshis prior. It follows that starting with any initial belief p0, the agent’s posterior beliefsall belong to the countable set F (k)

r (p)∞k=−∞.

The decision value for the agent upon seeing s∗ is

VF,r(p) := D(Fr(p) || p)

=p

p+ (1− p)r· log

(1

p+ (1− p)r

)+

(1− p)r

p+ (1− p)r· log

(r

p+ (1− p)r

)= − log(p+ (1− p)r) +

(1− p)r

p+ (1− p)r· log r.

(15)

Similarly the decision value of s′ is

VG,r(p) := − log(pr + 1− p) +pr

pr + 1− p· log r. (16)

The following lemma establishes the cutoff structure for accepting signals:

Lemma 4. For all r ∈ (0, 1), when p ≥ 12VF,r(p) is strictly decreasing in p while VG,r(p)

is first increasing then decreasing. Moreover, VG,r(p) > VF,r(p) for every p ∈ (12, 1). In

words, the decision value of a “reinforcement” signal decreases in the prior probability,and it is always smaller than the decision value of a “counteracting” signal.

Proof. We omit the subscript reference to r and calculate that

V ′F (p) = − 1− r

p+ (1− p)r− r log r

(p+ (1− p)r)2;

V ′G(p) =

1− r

pr + 1− p+

r log r

(pr + (1− p))2.

23

Page 24: Paying (for) Attention: The Impact of Information ...Paying (for) Attention: The Impact of Information Processing Costs on Bayesian Inference Scott Duke Kominersy Xiaosheng Muz Alexander

For V ′F (p) < 0 we need (1− r)(p+ (1− p)r) > −r log r. As p ≥ 1

2, we only need to check

1−r2

2> −r log r. Differentiating twice, we see this inequality indeed holds for all r < 1.

For a similar reason, we see that V ′G(p) is first positive and then negative.

To show that VG(p) > VF (p) for p ∈ (12, 1), we notice first that the two functions are

equal at both endpoints of the interval. It then suffices to show that V ′G(p)−V ′

F (p) is firstpositive then negative, crossing 0 exactly once. We calculate that

V ′G(p)− V ′

F (p) =

(1− r)

(1

pr + 1− p+

1

p+ (1− p)r

)− (−r log r)

(1

(pr + 1− p)2+

1

(p+ (1− p)r)2

).

Now, the ratio between 1(pr+1−p)2

+ 1(p+(1−p)r)2

and 1pr+1−p

+ 1p+(1−p)r

equals

1

1 + r·(

pr + 1− p

p+ (1− p)r+

p+ (1− p)r

pr + 1− p

),

which is increasing in p. It follows that VG(p) > VF (p), and we have the lemma.

We now define a pair of cutoffs

pF,r(c) = maxp

VF,r(p) ≥ c;

pG,r(c) = maxp

VG,r(p) ≥ c.(17)

For every cost c ∈ (0, log 2) and sufficiently small r, we have VF (12) = VG(

12) > c. Lemma 4

then implies that 12< pF,r(c) < pG,r(c). We will frequently omit the reference to c to ease

notation.Using Lemma 4 and the symmetry of the information structure, we deduce that

the interval [0, 1] is divided into five subintervals : [0, 1 − pG,r], [1 − pG,r, 1 − pF,r], [1 −pF,r, pF,r], [pF,r, pG,r], [pG,r, 1]. If the agent’s prior p is in the first or last subinterval, he isin a belief trap and rejects both signals. If p is in the central subinterval, both signals areaccepted. If pF,r < p < pG,r, the agent accepts the signal s′ and rejects s∗. Similarly heaccepts s∗ and rejects s′ if 1− pG,r < p < 1− pF,r.

To complete the argument for cycling, we have to show that starting with any initialprior p0 ∈ (1− pG,r, pG,r), the agent’s posterior never jumps out of this interval to one ofthe two belief traps. For this it suffices that the agent’s belief never moves from the centralsubinterval to the rightmost (or likewise leftmost) subinterval. Using the notation we haveintroduced, we need to check Fr(pF,r) < pG,r. But this follows directly from Lemma 1:we have c = D(Fr(pF,r)||pF,r) < D(pF,r || Fr(pF,r)) = D(Gr(Fr(pF,r)) || Fr(pF,r)). Bydefinition (17), we have pG,r > Fr(pF,r). This proves the existence of belief cycling formyopic agents.

24

Page 25: Paying (for) Attention: The Impact of Information ...Paying (for) Attention: The Impact of Information Processing Costs on Bayesian Inference Scott Duke Kominersy Xiaosheng Muz Alexander

We can extend the argument to cover patient agents. Hold c and r fixed and considerthe following modified information structure parametrized by ϵ:

θ∗ θ′

s∗ 1r+1

· ϵ rr+1

· ϵs′ r

r+1· ϵ 1

r+1· ϵ

s0 1− ϵ 1− ϵ

Notice that s0 is totally uninformative, while conditional on s = s0 the informationstructure is the same as before.

Because s0 happens with overwhelming probability, the effective discount factor isclose to 0. So under this information structure the patient agent acts as if he is myopic.The argument above can thus be applied with little change.15

Proof of Proposition 6

We first consider a myopic agent whose information structure is given by

θ∗ θ′

s∗ 1− ϵ 1− λϵs′ ϵ λϵ

with some λ > 1 and ϵ > 0 sufficiently small. As before, assume that the outcome is equalto the model (that is, if the model is a the outcome 1 has probability 1 and vice versa for b).

Define r∗ = 1λand r′ = 1−λϵ

1−ϵ. Then using notation from the proof of Proposition 5, we

know that the signal s∗ takes the agent’s belief from p to Fr′(p) when he updates, whiles′ renders his belief Gr∗(p). Note that r

∗ < 1 is independent of ϵ, while r′ → 1− as ϵ → 0.Let us show that for r′ sufficiently close to 1,

VG,r∗(p) > VF,r′(p), ∀p ∈ (0, 1). (18)

Intuitively, because the signal s∗ is very weakly information of θ∗, its decision value is smalluniformly over p. The remaining technical difficulty is to establish the above inequality forp close to 0 or 1. For this we calculate the derivatives of VF and VG near these endpoints.From Lemma 4, V ′

G,r∗(0) = 1 − r∗ + r∗ log r∗, which is a fixed positive number. On the

other hand V ′F,r′(0) = −1−r′+log r′

r′→ 0 as r′ → 1. Thus V ′

F,r′(0) < V ′G,r∗(0) and similarly

V ′F,r′(1) > V ′

G,r∗(1) for ϵ sufficiently small, establishing (18).Moreover, from the expression for V ′

G,r∗(p) in Lemma 4 we see that this derivative

becomes uniformly (over p) close to zero as r′ → 1. SinceFr′ (p)

pbecomes uniformly close

to 1, we can use a similar argument to deduce

VG,r∗(Fr′(p)) > VF,r′(p), ∀p ∈ (0, 1). (19)

15Even though the region of accepting any signal may no longer be an interval for any positive ϵ, wecan leverage the fact that posteriors take only finitely many values. Thus, the places where the intervalproperty breaks down have vanishingly small length as ϵ → 0, and we can find an open interval of priorsfrom which these problematic beliefs will never be reached. The argument then goes through.

25

Page 26: Paying (for) Attention: The Impact of Information ...Paying (for) Attention: The Impact of Information Processing Costs on Bayesian Inference Scott Duke Kominersy Xiaosheng Muz Alexander

Now take any c < minVG,r∗(p0), VG,r∗(η) (recall r∗ is fixed). As in the proof ofProposition 5, the interval [0,1] is divided into five subintervals L, CL, C, CR, R. In theleftmost subinterval L and the rightmost subinterval R, the agent rejects both signals. Heaccepts both signals in the central subinterval C, which may be empty. Finally he acceptss′ but rejects s∗ in both CL and CR, due to (18). The choice of c ensures that the initialbelief p0 does not belong to R, and (19) implies that the belief will never enter R (as pcannot jump from C to R). It follows that the agent’s belief will almost surely enter L,after a finite number of consecutive s′ signals. By the choice of c, the final belief is lowerthan η as desired.

Finally we can extend this argument to a patient agent by making ϵ even smaller,just as we did with the other proofs. A key observation in this setting is that a smaller ϵnot only makes signal s∗ less informative, but it also makes any s′ signal more valuablebecause the agent won’t get another one for a long time. We omit the technical details.

26