reinforcement learning china summer school

Reinforcement Learning China Summer School

RLChina 2021

Bayesian Brain

Prof. Jun Wang

UCL

August 17, 2021

1/49

2/49

Life and the universe

��

3/49

Life and the universe

4/49

Table of Contents

1 Learning in biological and computerised systems

2 Perception-control loop: the problem

3 Active inference: perception

4 Active inference: perception-control loop

5 An illustrative example

6 Control as inference

7 Multi-agent Variational Bayes

8 Conclusions

5/49

Table of Contents








8 Conclusions

6/49

What is “learning”?

in biology, learning meansa change of behaviour as a result of experience

in classical conditioning1, animals can learn to identify auseful pattern in the environment by associating onestimulus with another:repeated given {ring-a-bell, food}, a dog will start tosalivate (anticipate the upcoming of the food) when bellringing again

learned behaviours are adaptive, and thus are essentialfor animals to survive in the changing environment

e.g., they may learn not to eat certain foods if they haveever become ill after eating them

more learned behaviours =⇒ more intelligent

1Ivan Petrovitch Pavlov and William Gantt. “Lectures on conditioned reflexes:Twenty-five years of objective study of the higher nervous activity (behaviour) ofanimals.”. In: (1928).

7/49

Learning is not limited to biological systems

machine learning is to answer the question of howcomputer programmes can improve their performancethrough experience

past experience exists in various forms, typically including(1) human-labelled or unlabelled data sets,(2) past interactions with the environments, or(3) data collected from realistic simulators

with experience, computers can be trained to identify theassociations, spot the underlying patterns, and makeaccurate predictions and forecasts the futures

as a field, machine learning studies fundamental theoryand algorithms and provides a principled solution for thecomputational methods of learning

8/49

Three levels in any information processing system2

computational theory

what is the problem in a generic manner?what is the computing goal?what is the logic of strategy behind?

representation and algorithm

how can the identified computational problems besolved?what are the inputs/outputs and the algorithm for thetransformation?

hardware implementationhow can the representation and algorithm be realisedphysically?

human: biologicalAI: silicon using transistors

2David Marr. “Vision: A computational investigation into the humanrepresentation and processing of visual information”. In: Inc., New York, NY 2.4.2(1982).

9/49

Human intelligent v.s. current AI3

3Eliezer Sternberg. NeuroLogic: The Brain’s Hidden Rationale Behind OurIrrational Behavior. Vintage, 2016.

10/49

Typical narrow machine learning

Data X Label Y

predict

mathematically, ml can loosely boil down to the questionof finding a unknown function mapping f : X → Y ,

X is input feature space representing data points,Y is label space representing the knowledge outputsthus the mapping f represents a knowledge discoveryprocess from a given data point x ∈ X to the specificlabel y ∈ Y associated with the data point:y = f (x)

however, as f is not known a priori, the goal of thelearning is to identify a hypothesis h ∈ H from apredefined set H to approximate the unknown f , so that

the learned function h can predict the output variableynew = h(xnew) for a given new data point

11/49

View narrow machine learning as function mapping

𝑓: 𝑋 → 𝑌

Unsupervisedlearning

Reinforcementlearning

Generativemodels

Supervisedlearning

𝑌: sample featuresX: class labels

Regression

Classification

𝑌: dependent variablesX: independent variables

𝑌: class labelsX: example features

Anomalydetection

Clustering𝑌: cluster index

X: example features

𝑌: it is an outlier or notX: example features

𝑌: actions to takeX: state features

machine learning tasks are different in the experience available,the objective function, and the specific learning algorithms

12/49

Machine learning: reasoning under uncertaintypattern recognition:

Data X Label Y

predict

decision making (reinforcement learning):

Agent 1 Environment

act

obs

Agent 1 Environment Agent 2

act obs

obs act

13/49

Table of Contents








8 Conclusions

14/49

A living system and its environment

Generativeprocess

Generativemodel

3 pt

𝑠"#$∗

𝑠"∗

𝑠"&$∗

𝑜"#$

𝑜"

𝑜"&$

𝑠"#$

𝑠"

𝑠"&$

𝑢"#$

𝑢"

Observation Sensation/internal state

Action

True but hidden state

Figure: Caption

15/49

Entropy

Entropy: a measure of the average surprise (oruncertainty or disorder) of a random event sampled froma probability distribution or density

definition: entropy of a random event X with possibleoutcomes x1, ..., xn:

H(p) = −∑

i

p(X = xi)logp(X = xi)

low entropy means that, on average, the outcome isrelatively predictable

16/49

Table of Contents








8 Conclusions

17/49

Variational Bayes4/Free energy principle5

an agent builds a world model by having an internalrepresentation of the world s from observation o as:

p(s|o) =p(o|s)p(s)∫

sp(o|s)p(s) dx

typically the denominator is intractable and one canapproximate the posterior using a tractable functionfamily q(s) ∈ Q by minimising a dis-similarity measure,e.g., KL Divergence:

KL(q(s)||p(s|o)) = Eq(s)[logq(s)

p(s|o)]

4Michael I Jordan et al. “An introduction to variational methods for graphicalmodels”. In: Machine learning 37.2 (1999), pp. 183–233.

5Karl Friston. “The free-energy principle: a unified brain theory?” In: Naturereviews neuroscience 11.2 (2010), pp. 127–138.

18/49

Variational Bayes6/Variational free energy7

we then derive:

KL(q(s)||p(s|o)) = Eq(s)[logq(s)

p(s|o)]

=Eq(s)[log q(s)− log p(s, o) + log p(o)]

=Eq(s)[log q(s)− log p(s, o)] + log p(o)

reorganising gives the measure of surprise (or negativemodel evidence or log marginal distribution):

− log p(o) =Eq(s)[log q(s)− log p(s, o)]− DKL[]q(s)||p(s|o)]

≤Eq(s)[log q(s)− log p(s, o)] ≡ F(q)

where RHS is variational free energy (negative ELBO,Evidence Lower BOund)

6Michael I Jordan et al. “An introduction to variational methods for graphicalmodels”. In: Machine learning 37.2 (1999), pp. 183–233.


19/49

Free energy principle9

the VFE can be decomposed in three principal ways:F(q) = Eq(s)[log q(s)− log p(s, o)]

= Eq(s)[log q(s)]︸︷︷︸Negative Entropy

−Eq(s)[log p(s, o)]︸︷︷︸Energy

= Eq(s)[log p(o|s)]︸︷︷︸Accuracy

+DKL[q(s)|p(s)]︸︷︷︸Complexity

= DKL[q(s)‖p(s|o)]︸︷︷︸Posterior Divergence

− log p(o)︸︷︷︸Negative Log Model Evidence

free energy principle: the goal of a living system is tominimise the free energy F in order to avoid surprisingobservations (states) → maintain homeostasis8 (thusremain alive)

8The process whereby an open or closed system regulates its internal environmentto maintain its states within bounds.


20/49

Table of Contents








8 Conclusions

21/49

Extension to handle dynamics and control

the real world is dynamic and thus we extend the modelto have an observation o = {o1, ..., oT} and a hiddenstate at each point in time s = {s1, ..., sT}also add action policy π = {u1, ...uT} which interactswith the environment by altering the next hidden state

the generative model: 1) p (st+1 | st , ut),t > 1; p(s1) 2)p(ot |st)the Expected Free Energy (EFE), from time τ until thetime horizon T :G = Eq(oτ :T ,sτ :T ,π) [ln q (sτ :T , π)− ln p (oτ :T , sτ :T )]

where an agent’s goals are encoded as (subjective)desired distribution over observations p(oτ :T )10; thus wehave p (oτ , sτ ) ≈ p (oτ ) q (sτ | oτ )

10in ML, an optimality variable is added to encode the desires

22/49

Expected free energy12

a temporal mean-field factorisation11 is assumed:the variational function: q (sτ :T , π) ≈ q(π)

∏Tτ q(s | π)

the gen. model: p (oτ :T , sτ :T ) ≈∏Tt p (oτ ) q (sτ | oτ )

as a result, they are independent between time steps:

G = Eq(oτ :T ,sτ :T ,π) [ln q (sτ :T , π)− ln p (oτ :T , sτ :T )]

=Eq(oτ :T ,sτ :T |π)q(π) [ln q (sτ :T | π) + ln q(π)− ln p (oτ :T , sτ :T )]

=Eq(π)

[ln q(π) + Eq(oτ :T ,sτ :T |π)

[T∑

τ

[ln q(s | π)− ln p (oτ , sτ )]

]]

=Eq(π)

[ln q(π)−

(−

T∑

t

Gτ (π))]

= DKL

[q(π)‖e−

∑Tt Gτ (π)

]

where Gτ (π) = Eq(oτ ,sτ |π) [ln q (sτ | π)− ln p (oτ , sτ )]11David M Blei, Alp Kucukelbir, and Jon D McAuliffe. “Variational inference: A

review for statisticians”. In: Journal of the American statistical Association 112.518(2017), pp. 859–877.

12Beren Millidge, Alexander Tschantz, and Christopher L Buckley. “Whence theexpected free energy?” In: Neural Computation 33.2 (2021), pp. 447–482.

23/49

Expected free energy

thus optimising G results in

q∗(π) = SoftMax(−∑T

t Gτ (π))

EFE at time τ , Gτ , can be further decomposed as:Gτ (π)

=Eq(oτ ,sτ |π) [ln q (sτ | π)− ln p (oτ , sτ )]

≈Eq(oτ ,sτ |π) [ln q (sτ | π)− ln p (oτ )− ln q (sτ | oτ )]

≈−Eq(oτ ,|π) [ln p (oτ )]︸︷︷︸Extrinsic Value

−Eq(oτ )DKL [q (sτ | oτ ) ‖q (sτ | π)]︸︷︷︸Epistemic Value

thus, optimal policies are obtained by minimising the sumof the expected free energies

EFE is estimated using the generative model to roll outpredicted futures, and compute the EFE of those futures

24/49


similar to variational free energy, EFE can be alsodecomposed as:Gτ (π)

=Eq(oτ ,sτ |π) [ln q (sτ | π)− ln p (oτ , sτ )]

≈Eq(oτ ,sτ |π) [ln q (sτ | π)− ln p (oτ )− ln q (sτ | oτ )]

≈−Eq(oτ ,sτ |π) [ln p (oτ )]︸︷︷︸Extrinsic Value

−Eq(oτ )DKL [q (sτ | oτ ) ‖q (sτ | π)]︸︷︷︸Epistemic Value

note that an epistemic uncertainty refers to thedeficiencies by a lack of knowledge or information;reducible with more data or a better model

the second term measures the reduced entropy of sτ whenobserved oτ → max its value → we intend to choose thepolicy such that H[q(s|π] is high and strong dependencybetween states and observation (aka H[q(s|o] is low)

26/49

Table of Contents








8 Conclusions

27/49

A simple example13

suppose there is an agent interacts with an environmentperception:

the environment has two hidden states s ∈ {1, 2}, e.g.,there is food in your stomach (1) or not (2)while s is NOT measurable, there is an observationo = {1, 2}, e.g., you feeling fed (1) or hungry (2)assume p(o|s) is given as a 2x2 likelihood matrix (called”A” matrix) which maps states to observations, e.g., ifyou fed, you have food and vice versa

control:transition prob. p(st |st−1, u) maps the previous state tothe next one, depending on action u = {u1, u2}parameterised by a separate 2x2 transition matrix (”B”)for each action, e.g., either go get food (u1) or donothing (u2); if u1, will have food in the next state,regardless of whether we have it now, and vice versa

13https:

//medium.com/@solopchuk/tutorial-on-active-inference-30edcf50f5dc

https://medium.com/@solopchuk/tutorial-on-active-inference-30edcf50f5dc


28/49

A simple example14

preference (as a form of objective or reward)

we also have prior preferences p(o), e.g., we like to befed and not hungry, so we assign a higher probability toobservation o = 1 fed. In other words, we expresspreferences over observations as probability p(o)

14https:




29/49

A simple example15

action selection: option 1) q(s|π)→ q(o|π)→ G(π)→ q(π)option 2 u∗t+1 = arg maxu DKL

[ASt+1||A(B(u))St

], Bayesian

model average method

15https:




30/49

Unifying perception and control (Bayesian brain)16

o1

s1

a1 o2 a2

……θ1 s2 θ 2

x1 x2

s1

a1 a2

……s2

o1 o2

……

oτ

sτ

o0 o1

s0 s1

o0 oτ

s0 sτ

o0 oτ

s0 sτ

q(s) ∼ pθ(s |x, o) pθ(x, o |s)pθ(s)

the latent state s: true world configuration such as pixelassignment, the optimality o is a binary variable

the perception model includes a bottom-up recognitionmodel q(s) and a top-down generative model p(x , o, s)(decomposed into the likelihood p(x , o|s) and the priorbelief p(s))

the prior knowledge θ represents the physical law of theenvironment (the property of each object)

control is performed by taking an action a to change theenvironment state.

16Minne Li et al. “Joint Perception and Control as Inference with an Object-basedImplementation”. In: (2020).

31/49

Table of Contents








8 Conclusions

32/49

Bayesian decision principlemaking decision under uncertainty can be interpreted asmaximising expected utility in the face of uncertainty17

the uncertainty is captured by the (hidden) state ofnature: z ∈ Zdecisions (aka actions): a ∈ Areward: R(z , a)the posterior distribution p(z |o), where we can perform astatistical investigation to obtain information (denoted aso = (o1, o2, ..., on) ∈ O) about the nature state θ

(Conditional Bayes Decision Principle)

a∗|o = arg maxa

Eθ∼p(z|o)[R(o, a)], (1)

where the principle is the only fundamentally correct analysis18.Yet, the posterior can be difficult to obtain practically

17John Von Neumann and Oskar Morgenstern. Theory of Games and EconomicBehavior. Princeton University Press, 1953.

18James O Berger. Statistical decision theory and Bayesian analysis. SpringerScience & Business Media, 2013.

33/49

The duality between inference and control

the filtering problem is shown to be the dual of thenoise-free regulator (control) problem19

Figure: Optimal filter Figure: Optimal controller

19Rudolph Emil Kalman. “A new approach to linear filtering and predictionproblems”. In: (1960).

34/49

Control as inference:20

(a) states s are hidden

a1 a2 a3 a4

s1 s2 s3 s4

(a) graphical model with states and actions

a1 a2 a3 a4

s1 s2 s3 s4

O1 O2 O3 O4

(b) graphical model with optimality variables

Figure 1: The graphical model for control as inference. We begin by laying out the states and actions, whichform the backbone of the model (a). In order to embed a control problem into this model, we need to addnodes that depend on the reward (b). These “optimality variables” correspond to observations in a HMM-style framework: we condition on the optimality variables being true, and then infer the most probable actionsequence or most probable action distributions.

A task in this framework can be defined by a reward function r(st, at). Solving a task typicallyinvolves recovering a policy p(at|st, θ), which specifies a distribution over actions conditioned onthe state parameterized by some parameter vector θ. A standard reinforcement learning policy searchproblem is then given by the following maximization:

θ⋆ = argmaxθ

T∑

t=1

E(st,at)∼p(st,at|θ)[r(st, at)]. (1)

This optimization problem aims to find a vector of policy parameters θ that maximize the totalexpected reward

∑t r(st, at) of the policy. The expectation is taken under the policy’s trajectory

distribution p(τ), given by

p(τ) = p(s1, at, . . . , sT , aT |θ) = p(s1)

T∏

t=1

p(at|st, θ)p(st+1|st, at). (2)

For conciseness, it is common to denote the action conditional p(at|st, θ) as πθ(at|st), to emphasizethat it is given by a parameterized policy with parameters θ. These parameters might correspond,for example, to the weights in a neural network. However, we could just as well embed a standardplanning problem in this formulation, by letting θ denote a sequence of actions in an open-loop plan.

Having formulated the decision making problem in this way, the next question we have to ask toderive the control as inference framework is: how can we formulate a probabilistic graphical modelsuch that the most probable trajectory corresponds to the trajectory from the optimal policy? Or,equivalently, how can we formulate a probabilistic graphical model such that inferring the posterioraction conditional p(at|st, θ) gives us the optimal policy?

2.2 The Graphical Model

To embed the control problem into a graphical model, we can begin simply by modeling the rela-tionship between states, actions, and next states. This relationship is simple, and corresponds to agraphical model with factors of the form p(st+1|st, at), as shown in Figure 1 (a). However, thisgraphical model is insufficient for solving control problems, because it has no notion of rewards orcosts. We therefore have to introduce an additional variable into this model, which we will denoteOt. This additional variable is a binary random variable, where Ot = 1 denotes that time step t isoptimal, and Ot = 0 denotes that it is not optimal. We will choose the distribution over this variableto be given by the following equation:

p(Ot = 1|st, at) = exp(r(st, at)). (3)

The graphical model with these additional variables is summarized in Figure 1 (b). While this mightat first seem like a peculiar and arbitrary choice, it leads to a very natural posterior distribution over

3

(b) states s are observable

the idea: add an optimality variable Ot = 1 and assume abiased probability with reward r :p (Ot = 1 | st , at) = exp (r (st , at))

20Sergey Levine. “Reinforcement learning and control as probabilistic inference:Tutorial and review”. In: arXiv preprint arXiv:1805.00909 (2018); Beren Millidge et al.“On the relationship between active inference and control as inference”. In:International Workshop on Active Inference. Springer. 2020, pp. 3–11.

35/49


let us look at an MDP model (states are observable)

the evidence: Ot = 1 for all t ∈ {1, ...,T}; thus ELBO is

log p (O1:T ) = log

∫∫p (O1:T , s1:T , a1:T ) ds1:Tda1:T

= log

∫∫p (O1:T , s1:T , a1:T )

q (s1:T , a1:T )

q (s1:T , a1:T )ds1:Tda1:T

= log E(s1:T ,a1:T )∼q(s1:T ,a1:T )

[p (O1:T , s1:T , a1:T )

q (s1:T , a1:T )

]

≥ E(s1:T ,a1:T )∼q(s1:T ,a1:T ) [log p (O1:T , s1:T , a1:T )− log q (s1:T , a1:T )]

21Sergey Levine. “Reinforcement learning and control as probabilistic inference:Tutorial and review”. In: arXiv preprint arXiv:1805.00909 (2018).

36/49


we make use of the evidence p (O1:T , s1:T , a1:T ) =[p (s1)

∏Tt=1 p (st+1 | st , at)

]exp(∑T

t=1 r (st , at))

and

factorise the variational distribution as q(τ) ≡q (s1:T , a1:T ) = q (s1)

∏1t=1 q (st+1 | st , at) q (at | st)

this leads to the final ELBO: log p (O1:T ) ≥Eq(s1:T ,a1:T )

[∑Tt=1 r (st , at)− log q (at | st)

]

where we define q (st+1 | st , at) = p (st+1 | st , at)


37/49


without using temporal mean-field factorisation as ActiveInference, one can derive Value Iteration similar to astandard RL solution

the ELBO can be decomposed recursively as:Eq(st ,at) [r (st , at)− log π (at | st)] +Eq(st ,at)

[Est+1∼p(st+1|st ,at) [V (st+1)]

]=

Eq(st)

[−DKL

(π (at | st) ‖ 1

exp(V (st))exp (Q (st , at))

)+ V (st)

]

where we define:Q (st , at) ≡ r (st , at) + Est+1∼p(st+1|st ,at) [V (st+1)]V (st) ≡ log

∫A exp (Q (st , at)) dat

→ π (at | st) = exp (Q (st , at)− V (st))


38/49

Table of Contents








8 Conclusions

39/49

Agent modelling with maximum entropy objective24

each agent pursues the maximal cumulative reward

max ηi(πθ) = E

[∞∑

t=1

γtR i(st , ait , a−it )

], (2)

with actions (ait , a−it ) sampled from policy (πi

θi , π−iθ−i )

a strategy profile (π1∗, . . . , πn∗) reaches optimum when:

Es∼ps ,ai∗t ∼πi∗,a−i∗t ∼π−i∗

[∞∑

t=1

γtR i(st , ai∗t , a

−i∗t )

]

≥ Es∼ps ,ait∼πi ,a−it ∼π−i

[∞∑

t=1

γtR i(st , ait , a−it )

](3)

∀π ∈ Π, i ∈ (1 . . . n),

where π = (πi , π−i) and agent i ’s optimal policy is πi∗

24Zheng Tian et al. “A regularized opponent model with maximum entropyobjective”. In: IJCAI (2019).

40/49

Agent modelling with maximum entropy objectivea lower bound on the likelihood of optimality of agent i :

logP(Oi1:T = 1|O−i1:T = 1) ≥

∑

t

E(st ,ait ,a−it )∼q[R i(st , a

it , a−it )

+ H(π(ait |st , a−it ))− DKL(ρ(a−it |st)||P(a−it |st))] (4)

=∑

t

Est [Eait∼π,a−it ∼ρ

[R i(st , ait , a−it ) + H(π(ait |st , a−it ))

︸︷︷︸MEO

]

− Ea−it ∼ρ

[DKL(ρ(a−it |st)||P(a−it |st))]︸︷︷︸

Regulariser of ρ

]. (5)

ρ(a−it |st , o−it = 1) is agent i ’s opponent modelπ(ait |st , a−it , o it = 1, o−it = 1) is the agent i ’s conditionalpolicy at optimum (o it = o−it = 1) andP(a−it |st , o−it = 1) is the prior of opponent modelthe prior P(a−it |st , o−it = 1) is estimated empiricallywe drop (o it = 1, o−it = 1) in π, ρ and P(a−it |st)

41/49

Recursive reasoning25

Example: in the “beauty contest” game, players are askedto pick numbers from 0 to 100, and the player whosenumber is closest to 2/3 of the average wins a prize

25Colin F Camerer, Teck-Hua Ho, and Juin-Kuan Chong. “A cognitive hierarchymodel of games”. In: The Quarterly Journal of Economics 119.3 (2004), pp. 861–898.

42/49

Multi-agent generalized recursive reasoning26

s<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>

a�i<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>

Oi<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>

ai1<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>

⇡i1(a

i|s, a�i)<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>




ai0

<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>

a�i1<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>

ai2<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>

⇡i2(a


⇡i0(a

i|s)<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>

⇢�i1 (a�i|s, ai)


⇢�i0 (a�i|s)


Level 1 Level 2

Lower-Level Recursive Reasoning




Level k

⇡ik(ai|s, a�i)


aik


a�ik�1


...<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>

⇢�i0 (a�i|s)


aik�2


⇢�ik�1(a

�i|s, ai)<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>

⇡ik�2(a


· · ·<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>

a0<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>

⇡i0(a

i|s)<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>

a�i0


/<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>

unobserved opponent policies are approximated by ρ−i

agent i recursively reasons about opponents (grey area)

in the recursion, agents with higher-level beliefs take thebest response to the lower-level thinkers’ actions.

26Ying Wen et al. “Modelling Bounded Rationality in Multi-Agent Interactions byGeneralized Recursive Reasoning”. In: IJCAI (2020).

43/49

Table of Contents








8 Conclusions

44/49

Remarks

AI: reasoning under uncertainty

a learning system is an information system

pattern recognition and decision making problems(include multiagent learning) can be unified and modelledby probabilistic inference such as Variational Bayes (e.g.,active- inference or inference-as-control)

as such information theory can be directly utilised

however, we call for new research on information theorythat deals with ”intrinsic information exchange”

45/49

References I

James O Berger. Statistical decision theory andBayesian analysis. Springer Science & BusinessMedia, 2013.

David M Blei, Alp Kucukelbir, andJon D McAuliffe. “Variational inference: A reviewfor statisticians”. In: Journal of the Americanstatistical Association 112.518 (2017),pp. 859–877.

Colin F Camerer, Teck-Hua Ho, andJuin-Kuan Chong. “A cognitive hierarchy modelof games”. In: The Quarterly Journal ofEconomics 119.3 (2004), pp. 861–898.

46/49

References II

Karl Friston. “The free-energy principle: a unifiedbrain theory?” In: Nature reviews neuroscience11.2 (2010), pp. 127–138.

Michael I Jordan et al. “An introduction tovariational methods for graphical models”. In:Machine learning 37.2 (1999), pp. 183–233.

Rudolph Emil Kalman. “A new approach to linearfiltering and prediction problems”. In: (1960).

Sergey Levine. “Reinforcement learning andcontrol as probabilistic inference: Tutorial andreview”. In: arXiv preprint arXiv:1805.00909(2018).

47/49

References III

Minne Li et al. “Joint Perception and Control asInference with an Object-based Implementation”.In: (2020).

David Marr. “Vision: A computationalinvestigation into the human representation andprocessing of visual information”. In: Inc., NewYork, NY 2.4.2 (1982).

Beren Millidge et al. “On the relationship betweenactive inference and control as inference”. In:International Workshop on Active Inference.Springer. 2020, pp. 3–11.

48/49

References IV

Beren Millidge, Alexander Tschantz, andChristopher L Buckley. “Whence the expectedfree energy?” In: Neural Computation 33.2(2021), pp. 447–482.

Ivan Petrovitch Pavlov and William Gantt.“Lectures on conditioned reflexes: Twenty-fiveyears of objective study of the higher nervousactivity (behaviour) of animals.”. In: (1928).

Eliezer Sternberg. NeuroLogic: The Brain’sHidden Rationale Behind Our Irrational Behavior.Vintage, 2016.

Zheng Tian et al. “A regularized opponent modelwith maximum entropy objective”. In: IJCAI(2019).

49/49

References V

John Von Neumann and Oskar Morgenstern.Theory of Games and Economic Behavior.Princeton University Press, 1953.

Ying Wen et al. “Modelling Bounded Rationalityin Multi-Agent Interactions by GeneralizedRecursive Reasoning”. In: IJCAI (2020).

reinforcement learning china summer school

Documents