integrating declarative long-term memory retrievals into … · 2020. 8. 13. · from long-term...

16
Advances in Cognitive Systems X (20XX) 1-6 Submitted X/20XX; published X/20XX Integrating Declarative Long-Term Memory Retrievals into Reinforcement Learning Justin Li JUSTINNHLI @OXY . EDU Computer Science, Occidental College, Los Angeles, CA 90041 USA Abstract The common model of cognition defines both declarative long-term memories (LTM) and rein- forcement learning (RL) components, but their interaction remains relatively unexplored. We present a framework in which agents that follow the common model of cognition can learn to re- trieve from LTM based only on task rewards, and use the resulting knowledge to select actions. This task- and knowledge-agnostic approach defines an internal RL problem where the agent’s working memory is the state, with actions that allow the agent to query LTM. We use linear weights on fea- tures of predicate-object-pairs to approximate the value function, speeding up learning when new entities are encountered. We show that our approach scales to an LTM that contains the complete contents of DBpedia, that learning is transferable to new percepts, and extends to sequences of queries and retrievals. 1. Introduction The recently proposed common model of cognition (Laird et al., 2017) represents a convergence of research in the cognitive systems and cognitive modeling communities. The common model describes the fixed-mechanisms of a human-like mind, including components for the storage of both procedural and declarative knowledge, as well as the interfaces that define how they interact. The model also defines the means of learning in these components: where declarative learning occurs via the storage of new information, procedural learning occurs via reinforcement that adjusts action selection. While declarative long-term memory and reinforcement learning are core parts of the memory and learning systems in the model, research has tended to focus on each system individually, and rarely on how these two components interact. This represents a significant gap in the literature. The vast majority of agents created in cogni- tive architectures have hard-coded rules for memory access, with no narrative for how those rules might be learned. These agent- and task-specific rules for memory retrieval are necessarily brittle, as they embed assumptions both about the task and about the representation of knowledge, and therefore fail to allow the agent to adapt to changes in the environment. Finally, the fixed patterns of memory access fail to explain how an agent might learn domain-independent higher-level memory strategies, such as rehearsal and knowledge search. These strategies build on top of the architec- tural memory mechanisms, and have been posited to be necessary for complex reasoning in general autonomous intelligent agents (Laird & Mohan, 2018). In order to model the complex memory © 20XX Cognitive Systems Foundation. All rights reserved.

Upload: others

Post on 20-Feb-2021

2 views

Category:

Documents


0 download

TRANSCRIPT

  • Advances in Cognitive Systems X (20XX) 1-6 Submitted X/20XX; published X/20XX

    Integrating Declarative Long-Term Memory Retrievals intoReinforcement Learning

    Justin Li [email protected] Science, Occidental College, Los Angeles, CA 90041 USA

    AbstractThe common model of cognition defines both declarative long-term memories (LTM) and rein-forcement learning (RL) components, but their interaction remains relatively unexplored. Wepresent a framework in which agents that follow the common model of cognition can learn to re-trieve from LTM based only on task rewards, and use the resulting knowledge to select actions. Thistask- and knowledge-agnostic approach defines an internal RL problem where the agent’s workingmemory is the state, with actions that allow the agent to query LTM. We use linear weights on fea-tures of predicate-object-pairs to approximate the value function, speeding up learning when newentities are encountered. We show that our approach scales to an LTM that contains the completecontents of DBpedia, that learning is transferable to new percepts, and extends to sequences ofqueries and retrievals.

    1. Introduction

    The recently proposed common model of cognition (Laird et al., 2017) represents a convergenceof research in the cognitive systems and cognitive modeling communities. The common modeldescribes the fixed-mechanisms of a human-like mind, including components for the storage ofboth procedural and declarative knowledge, as well as the interfaces that define how they interact.The model also defines the means of learning in these components: where declarative learningoccurs via the storage of new information, procedural learning occurs via reinforcement that adjustsaction selection. While declarative long-term memory and reinforcement learning are core partsof the memory and learning systems in the model, research has tended to focus on each systemindividually, and rarely on how these two components interact.

    This represents a significant gap in the literature. The vast majority of agents created in cogni-tive architectures have hard-coded rules for memory access, with no narrative for how those rulesmight be learned. These agent- and task-specific rules for memory retrieval are necessarily brittle,as they embed assumptions both about the task and about the representation of knowledge, andtherefore fail to allow the agent to adapt to changes in the environment. Finally, the fixed patterns ofmemory access fail to explain how an agent might learn domain-independent higher-level memorystrategies, such as rehearsal and knowledge search. These strategies build on top of the architec-tural memory mechanisms, and have been posited to be necessary for complex reasoning in generalautonomous intelligent agents (Laird & Mohan, 2018). In order to model the complex memory

    © 20XX Cognitive Systems Foundation. All rights reserved.

  • J. LI

    processes in people (Burgess & Shallice, 1996), as well as to endow artificial agents with generalmemory capabilities, a generic method for learning to use memory is necessary.

    In this paper, we take a step towards this goal by proposing a reinforcement learning frame-work that is capable of learning to retrieve knowledge from a large knowledge base in declarativelong-term memory. Our framework is capable of generalizing across different entities in a knowl-edge base, and can adapt to different task domain rewards. To our knowledge, we are the first toinvestigate reinforcement learning over large knowledge bases in a cognitive architecture. The maincontributions of our paper are:

    • We present a task- and knowledge-agnostic system that is capable of retrieval knowledgefrom long-term memory, which can then be used for action selection, both via reinforcementlearning.

    • We show that our architecture is able to learn different patterns long-term memory access,that corresponds to different graph structures in the knowledge base.

    • We show that our architecture is able to learn even when the knowledge base has millions ofentities and tens of millions of facts.

    2. Background

    We are interested in the problem of how an agent might, in the context of a reinforcement learningtask, bring in relevant knowledge from long-term memory for action selection. Crucially, it is thetask reward that determines the appropriate knowledge to retrieve. In this section, we first describethe relevant components of the common model, before casting them into reinforcement learning andknowledge representation terms. We do this both for notational clarity, and to show that the systemis agnostic to the contents of knowledge base.

    Any non-trivial artificial agent must represent its environment internally. In the common modelof cognition, this representation is stored in working memory, which contains not only the agent’spercepts, but also other knowledge that is immediately relevant to the situation. The contents ofworking memory represents the current state of the agent, and serves as the only source of knowl-edge for reasoning and decision making.

    In addition to working memory, the agent also has a long-term memory (LTM), which containsall knowledge which may or may not be relevant at any time. LTM is commonly further dividedinto autobiographical knowledge stored in episodic memory and world knowledge stored in seman-tic memory; this paper focuses on the latter. Knowledge in semantic memory may be vast, andprevious work have included linguistic knowledge from WordNet and common sense knowledgefrom DBpedia (Salvucci, 2015; Li & Kohanyi, 2017). For clarity, we will use “knowledge base”(KB) to refer to a source of knowledge and “long-term memory” to refer to the component in anagent architecture, into which a KB may be loaded.

    Since only working memory is usable for reasoning, LTM knowledge must first be copied intoworking memory. The common model of cognition defines two ways to access LTM. First, the agentcould query or search LTM by creating a cue, which is a description of the desired knowledge. Forexample, a query might ask for “a rock band formed in 1965”, is then matched with the contents of

    2

  • RETRIEVALS FOR REINFORCEMENT LEARNING

    LTM. A result of this search (e.g., Pink Floyd), as well as associated information about that entity(e.g., its country of origin), is then copied into working memory for agent use. The agent is alsoable to retrieve information about an entity that is already in working memory. We overload theterm “retrieval” to mean either method of copying LTM knowledge into working memory.

    Formally, the contents of LTM is an edge-labeled directed graph, defined by the tuple 〈P , U ,L, C〉 where U is the set of entities; P the set of predicates; L the set of literals, which correspondto data such as numbers and strings; and C the contents of the KB, a set of triples 〈U,P, U ∪ L〉.For convenience, we refer to the three elements of a triple as the subject, the predicate, and theobject. For example, the fact that Pink Floyd’s album The Wall was released on November 30, 1979is represented by the triple 〈The_Wall, releaseDate, "1979-11-30"〉. Since we will referto the predicate-object pair of a triple frequently, we denote the objects as V = U ∪ L and call apair 〈p ∈ P, v ∈ V 〉 an attribute of the subject entity.

    Within this formalism, the cue for querying LTM for knowledge is a set of attributes that de-scribe the same entity. Assuming such an entity exists, all the attributes of that entity will be placedinto working memory. Similarly, the retrieval of an entity would place all its attributes into workingmemory.

    Finally, we assume that the agent task to be a reinforcement learning problem, which is definedby the tuple 〈S,A, T,R〉, where S is the set of states, A the set of actions, T : S × A → S thetransition function, and R : S × A → R the reward function. Importantly for this work, we donot consider working memory or LTM as part of the state, but only the state of the environment.This allows us to define a task (e.g., finding an album at the record store) independently fromthe knowledge the agent can access (e.g., information about bands, genres, etc.). To make thisdistinction clear, we call the task the external environment, defined by the external states Sext,actions Aext, and so on. We define the corresponding internal environment, which includes actionsfor accessing LTM, in Section 5.

    3. Problem Definition

    The underlying motivation for integrating LTM retrievals with RL is that knowledge in LTM may al-low the agent to generalize and transfer its the policy to new percepts. The knowledge retrieved fromLTM serve as features that could reduce the state space, or that may de-alias partially-observablestates. Since the use of RL with working memory is well-studied (Nason & Laird, 2005; Taatgenet al., 2005), the main challenge is in representing the interface between working memory and LTM,and in propagating the reward to queries and retrievals.

    This problem of learning to retrieve from memory is difficult for several reasons. First, LTMmay contain large amounts of knowledge — DBpedia, for example contains information about3M entities, and multiple attributes about each entity. It is therefore computationally intractableto consider the entire contents of LTM during action selection. Second, the agent does not knowwhat information is relevant. While we assume that relevant knowledge exists in LTM, only a tinysubset of the KB is likely to be meaningful in any particular situation. Finally, creating cues forLTM presents a large discrete state and action space, which have traditionally been difficult forRL agents. Techniques that apply elsewhere are not applicable to the problem of retrieving from

    3

  • J. LI

    LTM, as the actions cannot be embedded in a higher-dimensional space (Dulac-Arnold et al., 2015).At the same time, the semantic structure of KBs suggests that generalizations should be possible,where the same predicates apply to multiple entities. These challenges require an efficient valuefunction approximation that could ignore the majority of irrelevant knowledge in the KB whiletaking advantage of predicate semantics for generalization.

    We make two simplifying assumptions in our approach. First, we assume that the externalpercepts are represented as sets of predicate-object pairs, and that they match their representation inLTM. This representation of percepts is in line with other cognitive architectures, and it allows usto sidestep the problem of symbol alignment with the KB, which we consider out of scope for thispaper. Second, we assume that the appropriate knowledge to retrieve from LTM is fixed — that is,the same graph structure relates the environmental percept to the desired knowledge. This is againin line with how ACT-R and Soar agents are written, and as it is hard to imagine the situation inwhich knowledge is necessary but is completely unrelated to the current percepts.

    We are additionally concerned with two desiderata. First, the architecture should be agnostic toboth the task and the knowledge base — it should not depend on particularities of the KB or thedomain. In particular, the architecture should not require knowledge of the structure of the KB, orwhat knowledge should be retrieved for the task. Second, the system should be consistent with howreinforcement learning is implemented in existing architectures such as ACT-R and Soar, as theyare the foremost examples of the common model.

    4. Related Work

    While there is work on using large KBs for cognitive modeling in the cognitive systems commu-nity (Douglass et al., 2009; Li & Kohanyi, 2017), the queries and retrievals are almost alwayshard-coded and not learned. The closest work to this paper is Gorski & Laird (2011), where theagent learned to use Soar’s episodic memory, but retrieving knowledge from a large declarative KBpresents additional challenges. First, KBs such as DBpedia are significantly larger and richer thanthe accumulated experiences of an agent in a limited domain. Where previous work looked at adomain that has on the order of twenty distinct symbols, common-sense KBs often contain millionsof entities and even more relations about them. Second, the graphs of KBs present representationaldifficulties as the agent must be able to encode the graphical structure, while prior work focusedon propositional symbols. Finally, to the best of our knowledge, prior work has failed to generalizeacross different instances of a predicate, thereby negating the benefits of the semantics of a KB.Again, this poses challenges as to the representation of the KB and the value function. This is incontrast to the flat propositional representation in the prior work.

    While RL has been used for reasoning (Taylor et al., 2007) and relation learning in KBs (Daset al., 2018), using KBs in RL have received relatively little attention. Prior work on improvingRL with external knowledge have focused on providing features (Bougie & Ichise, 2018; Bougieet al., 2018) and value advice (Daswani et al., 2014). Other work have integrated RL, symbolic andrelational learning, and natural language processing in the same system, with an emphasis on onlineknowledge acquisition (Mitchell et al., 2015). None of these systems directly address the questionof using knowledge to solve tasks through reinforcement.

    4

  • RETRIEVALS FOR REINFORCEMENT LEARNING

    It should be noted that KB feature extraction for RL is different from two other AI tasks, ques-tion answering and relation learning, even though all three tasks might apply RL to a KB. A keydifference of this work from question answering is that questions such as “Who is the artist of TheWall?” is never explicitly posed — or indeed, the agent may not know whether answering such thatquestion is appropriate, as opposed to answering another question or not answering any questionsat all. Rather, the agent is trying to maximize its reward, without regard to whether any particularquestion is answered correctly. Additionally, work in question answering tends to take a supervisedlearning approach, by using sequence-to-sequence translations of English questions to KB queries(Soru et al., 2017). Such training data is not generally available in an RL setting, and it’s unclearthat a similar translation from state sequences would be possible.

    Similarly, while prior work have used RL for learning new relations (Xiong et al., 2017; Daset al., 2018), that approach still requires existing question-answer pairs (represented by the startingand ending entities), with the agent being rewarded for navigating to the correct answer. In contrast,no such pairs are provided to the agent in our work, and the agent is not rewarded for locating theanswer in the KB. Instead, the reward depends entirely on the domain, independent of how the agentuses LTM.

    5. System Architecture

    The general approach we take in framing LTM retrieval as a sequential decision problem is to createan internal problem space where the agent has actions for querying and retrieval. One approachthat directly mirrors how current agents are written is to define each cue as an action. This ap-proach is necessarily domain-specific, however, as the valid cues must be defined based on the task.The obvious alternative is to consider all possible attributes in LTM, and make them available asbuilding-blocks for cues. The size of these KBs makes this computationally intractable, however;DBpedia, for example, has over 1,300 distinct predicates and over 4M distinct literals. Presentingthis many possible actions to the agent will make learning extremely slow.

    Instead of learning to create the correct cue for retrieval, we reframe the problem to that ofnavigating LTM as an edge-label directed graph. By the assumptions laid out in Section 3, theenvironment percepts directly correspond to entities in LTM. The agent can then “move” along theincoming and outgoing edges of those entities, which correspond to querying and retrieving LTMrespectively. Querying also allows the agent to “jump“ to more distant entities in LTM. The goalthen is to “navigate” to the entity of interest, retrieve it into working memory, then use it for theselection of external actions. Unlike using arbitrary attributes to build a cue suggested above, thenavigation framing suggests that the agent only has a limited number of actions available: onlythe immediate attributes of an entity in working memory can be used for querying, which greatlyreduces the action space.

    Our approach to facilitate this is to create an internal reinforcement learning problem 〈Sint, Aint,Tint, Rint〉, which is heavily inspired by on ACT-R’s buffers and the PRIMS actions (Taatgen,2013). The internal state Sint subsumes the external state Sext, with the internal actions Aint beinga strict superset of the external actions Aext, with additional internal actions that do not affect theexternal state. The internal state must represent the results of queries and retrievals, and the internal

    5

  • J. LI

    actions must allow the agent to create queries from external and internal percepts. One of our keyinsight is that it is sufficient for the agent to manipulate attributes in an internal state, and that theymap onto querying and retrieving from LTM. The result of retrievals and queries are themselves setsof attributes, which can then be used for additional retrievals and queries; more complex queries ofmultiple attributes can be constructed by adding one attribute at a time.

    We describe each of these components separately in the subsections below.

    5.1 Internal State

    The functional purpose of the internal state is to represent the knowledge necessary for the agent tomake decisions, and as such, it is equivalent to the working memory of the common model. Takinginspiration from ACT-R, the internal state is further divided into three buffers: the PERCEPTUALbuffer, the QUERY buffer, and the RESULT buffer. Each of these buffers contain a set of attributes,and they serve as interfaces to the environment and to LTM. The PERCEPTUAL buffer contains theexternal percepts, and its contents directly correspond to the external state. Since the PERCEPTUALbuffer is equivalent to the external state, the internal state is a strict superset of the external state.The QUERY buffer represents the query that the agent is executing in LTM, if any, while the RESULTbuffer represents the result of that query or retrieval. Both of these buffers could remain empty ifthe agent is not executing a query or if there are no results from LTM.

    Formally, an internal state is a set of tuples Sint = {〈b, p, v〉 , ...}, where b ∈ B = {PERCEPTUAL,QUERY, RESULT} is one of the three buffers. Changes to the internal state therefore consist of addingand removing these tuples, which can be thought of as adding attributes to and removing attributesfrom any of the buffers. The details of the action space are described in the next section.

    As an example, Table 1 shows part of the internal state of an agent tasked with determin-ing the artist of The Wall. In this example, the only attribute in the QUERY buffer is 〈label,"The Wall"〉, which is the query issued to LTM. The attributes of retrieved entity — the albumThe Wall — has been added to the RESULT buffer, and is can be used by the agent to determine itsnext action. The mechanics of how LTM interacts with working memory is described in the nextsection; the external state and task in this example is described in Section 6.1.

    Buffer Predicate ObjectPERCEPTUAL label "The Wall"

    location "Shelf A"QUERY label "The Wall"RESULT subject The_Wall

    label "The Wall"artist Pink_Floyd

    releaseDate "1979-11-30"... ...

    Table 1. An example of the internal state of the agent as it queries for the artist of The Wall.

    6

  • RETRIEVALS FOR REINFORCEMENT LEARNING

    5.2 Internal Actions and Transitions

    In order for the internal state space to interact with LTM, the internal actions must be able to retrievethe attributes of specific entities from LTM, as well as be able to query LTM with cues of one ormore attributes. While retrieval is an atomic action, the building of query cues is more complicated.Taking inspiration from the PRIMS set of actions (Taatgen, 2013), we instead define actions thatcopy and delete attributes from buffers, with the query itself occurring automatically. With this inmind, the internal problem space has five parameterized actions:

    • The copy action copies an attribute from one buffer to another. The action is parameterizedby the source and destination buffers, as well as the attribute to copy, and is instantiated ascopy(BUFSRC,pred,obj, BUFDST). When this action is taken, a new triple 〈BUFDST, pred,obj〉 is added to working memory.

    • The delete action deletes an attribute from a particular buffer. The action is parameterized bythe buffer and the attribute to delete, and is instantiated as delete(BUF,pred,obj). Whenthis action is taken, a triple 〈BUF,pred,obj〉 is removed to working memory.

    • The retrieve action retrieves all the attributes of an LTM entity into the RESULT buffer. Theaction is parameterized by the buffer and attribute, with the constraint that the object of theattribute must be an entity and not a literal. The action is instantiated as retrieve(BUF, pred,obj). When this action is taken, all triples in the RESULT buffer are removed, and all at-tributes of obj are added to the RESULT buffer.

    • The prev-result action gets the previous result of a query. When this action is taken, thecontents of the RESULT buffer is replaced with the attributes of the previous query result.

    • The next-result action gets the next result of a query. When this action is taken, the contentsof the RESULT buffer is replaced with the attributes of the next query result.

    Note that while a retrieve action exists, there is no corresponding query action. Instead, queryhappens automatically when the query buffer is changed by the copy and delete actions, whichwould modify the cue. Together with the prev-result and next-result actions, this set of actionsallows the agent to query and retrieve from LTM, and to iterate over the results.

    The various actions on the QUERY and RESULT buffers also affect each other. For example, ifno entity in LTM matches the query cue, the RESULT buffer is cleared to indicate that no resultswere found; the same occurs if the agent deletes the last attribute in (that is, empties) the QUERYbuffer. Similarly, the QUERY buffer is cleared when a retrieve action is taken, to avoid the situationof having cues that do not match the attributes in the RETRIEVAL buffer. These automatic changesto the buffers reduce the internal state space and thereby speed up learning.

    As an example, consider how an agent would determine the name of the artist of The Wall.The PERCEPTUAL buffer contains {〈label,"The Wall"〉} (as illustrated in Table 1, and LTMcontains a portion of DBpedia as show in Figure 1. Retrieving the name "Pink Floyd" wouldrequire two actions:

    7

  • J. LI

    dbr:The_Wall

    dbr:Pink_Floyd

    dbo:artist

    "The Wall"

    rdfs:label

    "T"

    rdfs:firstLetter

    "1979-11-30"

    dbo:releaseDate

    "1970-01-01"

    dbo:releaseDecade

    dbr:The_Dark_Side_of_the_Moon

    dbo:artist

    "The Dark Side of the Moon"

    rdfs:label

    "T"

    rdfs:firstLetter

    "1973-03-01"

    dbo:releaseDate

    "1970-01-01"

    dbo:releaseDecade

    dbr:London

    dbo:hometown

    "Pink Floyd"

    rdfs:label

    "P"

    rdfs:firstLetter

    dbr:United_Kingdom

    dbo:country

    "London"

    rdfs:label

    "L"

    rdfs:firstLetter

    "United Kingdom"

    rdfs:label

    "U"

    rdfs:firstLetter

    Figure 1. A portion DBpedia around Pink_Floyd.

    1. copy the 〈label,"The Wall"〉 attribute from the PERCEPTUAL buffer to the QUERY buffer.Since the QUERY buffer has changed, the architecture would automatically query LTM for anentity that has the label "The Wall", and would find the entity The_Wall . The RESULTbuffer would then be populated with the attributes of The_Wall, including the attribute〈artist,Pink_Floyd〉.

    2. retrieve the entity Pink_Floyd, which is the object of the artist predicate in the RE-SULT buffer. This will clear the QUERY buffer, and also populate the RESULT buffer with theattributes of Pink_Floyd, including the attribute 〈label,"Pink Floyd"〉.

    With the string "Pink Floyd" now in working memory, the agent can now use it for actionselection.

    We make several optimizations to reduce the size of the internal action space. First, the availabil-ity of the copy, delete, retrieve actions depends on the contents of the internal state. For example,no retrieve actions are possible if none of the buffers contain an entity whose attributes could beretrieved, as would be the case if all attributes were literals. Similarly, no prev-result or next-resultactions are possible if no query has been constructed, or if there is no next result. Since these ac-tions are parameterized by entities that ultimately come from either perception or from LTM, theset of possible concrete actions will grow monotonically as the agent encounters more entities fromthe environment and from the KB, even while there may only be a few available actions for anyparticular internal state. The space of internal actions is therefore a function of the size of the KB.

    We further limit the valid parameters for these actions to only those that are meaningful. Sincethe PERCEPTUAL and RESULT buffers represent the external environment and the KB respectively,directly modifying these buffers offer no benefits, and so they are invalid as targets for the copy anddelete actions. Similarly, the contents of the QUERY buffer were themselves copied from anotherbuffer, and therefore cannot serve as the source of a copy action. A final limit is on when internalactions are available at all, which we discuss in Section 5.5 below.

    5.3 Internal Rewards

    Since the reward ultimately comes from the external task, the reward for internal actions is con-strained by the domain. We make two assumptions about the internal reward function. First, weassume that the internal reward is negative, rint < 0, since a positive reward could lead to undesir-able loops Second, we assume that the internal reward is smaller (closer to zero) than the externalreward, which matches the intuition that accessing LTM is much less expensive than acting in theenvironment.

    8

  • RETRIEVALS FOR REINFORCEMENT LEARNING

    For internal actions to be learnable in the standard discounted-reward RL setting, the rewardstructure should discourage the agent from randomly trying external actions until one can be ex-ploited. This implies that, at any internal state, the internal reward plus the discounted externalreward must be greater than the external reward alone, such that internal actions are preferred. For-mally, the following equations with discount rate γ (with 0 < γ < 1) and internal and externalrewards rint and rext must hold:

    rint + γ ∗ E[rext] > E[rext]0 > rint > (1− γ) ∗ E[rext]

    This inequality implies that the external reward must be negative, with the internal reward beinga smaller negative that is modulated by the discount rate. We note that a constant negative externalreward, until the end of an episode, is commonly used to encourage reaching the goal in as fewactions as possible. We therefore do not consider this constraint to be overly burdensome.

    5.4 Value Function Approximation

    Although the reinforcement learning problem defined by the states, actions, and rewards thus faris sufficient to propagate external rewards through long-term memory retrievals, it remains com-putationally intractable. The size of the internal state space is exponential in the size of the KB,since each buffer could contain each unique attribute. The parameterized internal actions furtherincreases the complexity, making learning difficult for an agent that naively stores a Q-value witheach state-action pair.

    Complexity considerations aside, treating each state as distinct also negates the primary ad-vantage of using KBs: that of the semantics encoded within the graph structure. The normalizedpredicates in a KB such as DBpedia means that entities with the same predicate should have thesame relation, and that the actions appropriate for one entity should generalize to another. In orderto capture this generalization, a value function approximation is necessary.

    Our approach is to use a linear value function approximation, which was chosen for its abilityto include new features incrementally, by assuming that all unseen features have a weight of 0.Formally, the value of each state-action pair is approximated separately, with no shared weights, bythe equation:

    Q(s, a) =∑i

    wi,aφi(s)

    The features φi(s) of a given state are all the tuples of buffer and predicate; all tuples of buffer,predicate, and object; and a bias term. That is, a state {〈b1, p1, v1〉 , ..., 〈bn, pn, vn〉} would have2n + 1 features {〈b1, p1〉 , 〈b1, p1, v1〉 , 〈b2, p2〉 , ..., 1}, plus a bias term. To appropriately assigncredit, the weight of each feature is normalized by dividing by the number of features. The weightsare then updated by the standard Bellman update, parameterized by the learning rate α (with 0 <

    9

  • J. LI

    α < 1):

    δ = r + γmaxat

    Q(st, at)−Q(st−1, at−1)

    wi,a ← wi,a + αδφi(st−1)

    Note that since the features are the attributes in working memory, which could come from retrievalsfrom LTM, the number of features ultimately scales with the size of the KB. Unlike a tabular valuefunction, however, this scaling is merely quadratic and not exponential. Still, this presents a notinsignificant amount of memory, so the next section presents additional considerations to furtherreduce the state space.

    5.5 Other Design Considerations

    A final design decision is to prevent the agent from endlessly changing its internal state whileignoring the external environment. We do so by setting a limit on the number of consecutive internalactions that the agent could take, as well as a threshold on the cumulative negative reward of anepisode, below which the episode will end. Forcing the agent to take high-penalty external actions(and bringing it closer to the threshold) effectively enforces a depth limit on how much of the KBthe agent could explore. To be clear, the depth limit is not the number of consecutive internal action,but the number of internal actions the agent can take per external action, given the negative rewardthreshold. In theory, the depth limit is calculated by

    ⌊lt

    lrint+rext

    ⌋, where l is the internal action

    limit and t is the threshold. In practice, agents rarely reach this limit, as they are likely to selectan external action before reaching the maximum consecutive internal action limit. As long as theinternal action limit is set such that the desired entity is reachable, the effective depth limit onlyaffects learning speed and memory use, but not the optimal policy. A secondary consequence of thislimit is that it bounds the number of unique attributes that is retrieved from LTM, thus reducing thememory needed for the value function approximation.

    We note that the internal RL problem as described above is compatible with both ACT-R andSoar. The system of buffers was inspired by ACT-R, and while Soar’s semantic memory allowsfor graph-structured cues, it could also emulate the flat structure described here. Reinforcementlearning in both ACT-R and Soar occurs by assigning values to production rules, with the conditionsof those rules being the features. The main difficulty of implementing this system is generatingrules as new features are added, when the agent encounters or retrieves new knowledge. While thisis not currently possible within ACT-R and Soar, we suspect this is not due to conflicts with corearchitectural assumptions, but because reinforcement learning has not been attempted at this scale.Thus while this system is not implementable within ACT-R and Soar as they currently stand, webelieve it still satisfies both of the desiderata listed in Section 3.

    As a final note in comparing this system to ACT-R and Soar, many cognitive architectures usemetadata such as base-level activation to rank results from LTM. While these memory mechanismsmay be generally useful, they make learning to use memory harder, as they introduce an non-Markov element. If activation is used, querying with the same cue with the same working memorystate could lead to different results at different times, as different activation values in LTM couldchange which entity is retrieved. For this reason, in this paper the knowledge to return is chosen

    10

  • RETRIEVALS FOR REINFORCEMENT LEARNING

    deterministically, based only on the cues. Where multiple matching symbols exist, the agent coulduse the prev-result and next-result actions to iterate through possible results instead.

    6. Experimental validation

    To demonstrate and evaluate this framework, we show results from two experiments on a RecordStore domain. The goals are two fold:

    • To show that the agent is able to transfer learned behaviors to new percepts

    • To show that the agent is able to learn increasingly complex sequences of queries and re-trievals from LTM

    Note that these experiments do not correspond to any cognitive task, by only demonstrate theability of the RL system to learn to query and retrieve from LTM. Due to the above mentioned diffi-culty of dynamically creating features in ACT-R and Soar, the experiments described in this sectionwere done in Python. The source code is available at https://github.com/justinnhli/2020acs.

    6.1 The Record Store Domain

    As in the running example throughout this paper, this domain presents the agent with some informa-tion about an album, and tasks the agent with finding it on the shelves of a record store. Importantly,the information given to the agent does not correspond to the organization of the store, so while theagent may be asked with picking up The Wall, the shelves may instead be organized by the decadeof album release. Instead, the agent must access LTM for additional information about the album,and use the results to determine which shelf to go to. Since the focus of this paper is on the interfacebetween the agent and LTM, and not on advancing fundamental RL algorithms, we deliberatelychose a fully observable and deterministic domain. Observability here refers to environment, whichcorresponds to our definition of Sext, and not the full contents of the KB,

    In every episode, the agent is presented with information about one album. The external statein this domain contains only the album information (e.g., the title) and the agent’s location in thestore. The agent has access to the complete contents of DBpedia in this domain, with ∼24.1Mtriples. We used a version of DBpedia where all predicates have been mapped to the DBpediaontology, meaning that the ~1,300 predicates are reused throughout the KB. Additionally, to reflecthow record stores often arrange albums chronologically by decade and alphabetically by artist, weaugment DBpedia with two pieces of information. All entities with the predicate releaseDatealso get a releaseDecade predicate, with an appropriately-modified literal. Similarly, all entitieswith the predicate label are also augmented with a firstInitial predicate.

    Each external action in this domain (moving to a shelf) results in a reward of -10, unless the shelfcontains the desired album, in which case the agent receives a reward of 0 and the episode ends. Thenumber of external actions available depends on the metadata on which the shelves are organized(e.g., the first letter of the name of the artist), with one action corresponding to each category. Eachinternal action results in a reward of -0.1, which satisfies the constraints described in Section 5.3.

    11

  • J. LI

    6.2 Validation Experiment

    This experiment serves as a basic validation of our framework and attempts to answer several ques-tions. First, is our representation of internal states, actions, and linear value function approximationsufficient for the agent to converge to the optimal policy? Second, assuming convergence, howdoes the learning speed compare to a naive agent that simply uses trial and error to learn the cor-rect action? That is, how does our framework compare to a simple tabular agent for the externalRL problem? Third, can the internal action limit effectively serve to control the exploration of theagent, so large portions of LTM can be ignored? And finally, is the agent capable of transferringaction selection knowledge to previously-unseen entities?

    In this experiment, the agent is given the title of an album, and must navigate to the shelf for thedecade in which the album was released — information it can obtain via a single query. To explorethe question of transfer, we use two disjoint sets of 5,000 albums, out of about 15,000 albums thatexist in DBpedia. For the first 150,000 episodes, only the first set of albums will be presented,allowing time for the agent to learn the optimal policy. After 150,000 episodes, only the second setof albums will be used instead. The use of 150,000 episodes was determined from a pilot experimentas sufficient for agent learning to converge. Note that although the album entities and titles will bedifferent, the sequence of actions required to retrieve information from LTM remains the same.

    We look at three different agents: a simple tabular Q-learner, and two agents that use our frame-work with access to the complete contents of DBpedia through LTM, which we label the LTMagents. These two agents differ only by their limit on the number of consecutive internal actions;we label these agents LTM-4 (with an internal action limit of 4) and LTM-1 (with a limit of 1). Allagents use a learning rate of 0.1 and a discount rate of 0.9.

    Figure 2. Performance in the Record Store domain by different agents, where the store organizes albums bydecade. A new set of albums are presented at episode 150,000.

    The overall results are shown in Figure 2. The x-axis shows the number of episodes into training,and the y-axis shows the reward for that episode. The solid line shows the performance of thetabular Q-learner, and the dashed and dotted lines show the performance of the LTM-4 and LTM-1 agents respectively. All three agents are able to learn the optimal policy, which suggests thetwo LTM agents have learned to query for an album with the given title, and using the releasedecade in the result for action selection. This optimal policy is confirmed by examining the learnedweights, where the weights for the external actions are dominated by the feature that describes the

    12

  • RETRIEVALS FOR REINFORCEMENT LEARNING

    decade in which an album is released. However, the tabular agent converges more quickly to theoptimal policy, despite the generalizations afforded by the value function approximation. This isnot surprising, given that the internal environment is a superset of the external environment, andthus has a larger state and action space. Even with the amount of knowledge available in DBpedia,however, the LTM agents do not require significantly more episodes to learn the optimal policy.This suggests that the features for the value function approximation allow for efficient learning.

    Examining the results more closely, the LTM-4 agent is slower to converge than the LTM-1agent. In fact, although the LTM-4 agent learns over time, its learning has not reached convergencebefore episode 150,000 when new albums are introduced. This confirms our hypothesis that theinternal action limit is an effective control for agent learning. With a lower limit, the LTM-1 agenthas fewer opportunities to retrieve from or query LTM, which in turn reduces the number of distinctentities that it encounters. In effect, this reduces the number of weights the agent has to learn, sinceeach new predicate and attribute requires a new linear weight. This also has the effect of reducingthe amount of memory required by the agent.

    Finally, the performance of the agents are clearly affected at episode 150,000, when a new setof albums are presented. At this point, the performance of the tabular agent drops to initial levels,which is expected, since the new entities do not match any entries in its value function table, and theagent must learn from scratch. In contrast, we only see a moderate drop in performance for the twoLTM agents, with the slight decrease occurring as the agent learns the weights for the new entities.We attribute this behavior to the value function approximation. By learning that the predicate, butnot the predicate-object pair, is informative of which action to take, the agent is able to transfer itsknowledge of how to use LTM to new entities it has not encountered before.

    In summary, this experiment shows that our representation is sufficient for agents to learn anoptimal policy for using LTM, that it does not slow down learning dramatically, that we can controllearning with an internal action limit parameter, and that it allows agents to transfer its LTM strategyto new environmental percepts.

    6.3 Retrieval Complexity Experiments

    This second set of experiments is designed to show that the architecture is able to learn more com-plex retrievals from LTM. Within the Record Store domain, we require more complex retrievalsby altering the “organization” where albums are placed. We use three variants of the record storedomain:

    1. Given an album title, find the shelf corresponding to the decade of the album’s release. Thisonly requires one query using the album title, and is the same as the task in the first experi-ment.

    2. Given an album title, find the shelf corresponding to the first letter of the artist. This requiresone query using the album title, followed by a retrieval on the artist.

    3. Given an album title, find the shelf corresponding to the country of origin of the artist. Thisrequires one query using the album title, followed by three successive retrievals on the artist,their hometown, and finally the country in which that town is located.

    13

  • J. LI

    Figure 1 illustrates how these sequences of queries and retrievals correspond to traversing furtheralong the semantic network. For this experiment, the agent is tasked with finding 100 albums, andallowed to learn until its policies converge.

    Figure 3. Performance in the Record Store domain with different retrieval sequences.

    The overall results are show in Figure 3. In all three cases, the agent learns the optimal policyof first retrieving the relevant knowledge from LTM, before moving to the appropriate shelf. Asexpected, the larger the graph distance between the external percept and the desired entities, thelonger it takes for agent learning to converge, as a longer sequence of actions is required. Nonethe-less, these results suggest that the architecture can be applied to tasks and KBs that require differentand more complex features.

    7. General Discussion

    In this paper, we presented a framework that can learn to retrieve from LTM from external taskrewards. We provided evidence that our approach of using internal actions with a linear valuefunction approximation allows the agent to converge to the optimal policy for different patterns ofaccessing LTM and to transfer the memory access strategy to new entities. Other agent parameterseffectively limit the depth of exploration of the LTM, allowing our approach to work even with KBswith millions of triples. This work is therefore a step towards more tightly integrating the declarativememory and reinforcement learning components defined in the common model of cognition.

    This work opens the door for several areas of future work. First, the capabilities of the currentapproach is not isomorphic to the capabilities of ACT-R or Soar, as those architectures have addi-tional features for LTM access, such as prohibiting a particular attribute matching. We also havenot tested our approach with an additional buffer for the intermediate storage of retrieved entities,which would be necessary for any query patterns that require graph matching; the imaginal bufferin ACT-R serves exactly this purpose. ACT-R and Soar can also perform additional reasoning withthe results from LTM, which was not the focus of this paper. While we do not have empirical ev-idence of performance, we do not seen any major obstacles towards combining our approach withreasoning and additional LTM mechanisms.

    Second, one of the motivations for this work was the possibility for agents to learn higher-order memory strategies such as rehearsal. Many of these strategies interact with subsymbolic

    14

  • RETRIEVALS FOR REINFORCEMENT LEARNING

    metadata, such as base-level activation, which we’ve set aside in this work. More generally, humanmemory retrieval is highly strategic, with multiple processes for problem solving, elaboration, andverification (Burgess & Shallice, 1996). Similarly, metamemory phenomena such as feeling ofknowing may be explanable as additional features for deciding whether to query LTM. By allowingtask rewards to drive the pattern of LTM access, future work can explore tasks under which thesememory strategies and mechanisms are beneficial.

    References

    Bougie, N., Cheng, L. K., & Ichise, R. (2018). Combining deep reinforcement learning with priorknowledge and reasoning. Applied Computing Review, 18, 33–45.

    Bougie, N., & Ichise, R. (2018). Deep reinforcement learning boosted by external knowledge.Proceedings of the 33rd Annual ACM Symposium on Applied Computing (pp. 331–338). ACM.

    Burgess, P. W., & Shallice, T. (1996). Confabulation and the control of recollection. Memory, 4,359–412.

    Das, R., Dhuliawala, S., Zaheer, M., Vilnis, L., Durugkar, I., Krishnamurthy, A., Smola, A., &McCallum, A. (2018). Go for a walk and arrive at the answer: Reasoning over paths in knowl-edge bases using reinforcement learning. Proceedings of the 2018 International Conference onLearning Representations.

    Daswani, M., Sunehag, P., & Hutter, M. (2014). Reinforcement learning with value advice. Pro-ceedings of the 6th Asian Conference on Machine Learning (pp. 299–314).

    Douglass, S. A., Ball, J., & Rodgers, S. (2009). Large declarative memories in ACT-R. Proceedingsof the 9th International Conference on Cognitive Modeling (ICCM).

    Dulac-Arnold, G., et al. (2015). Reinforcement learning in large discrete action spaces. Fromhttps://arxiv.org/abs/1512.07679.

    Gorski, N. A., & Laird, J. E. (2011). Learning to use episodic memory. Cognitive Systems Research,12, 144–153.

    Laird, J., & Mohan, S. (2018). Learning fast and slow: Levels of learning in general autonomousintelligent agents. Proceedings of the 32nd AAAI Conference on Artificial Intelligence (AAAI).

    Laird, J. E., Lebiere, C., & Rosenbloom, P. S. (2017). A standard model of the mind: Toward acommon computational framework across artificial intelligence, cognitive science, neuroscience,and robotics. AI Magazine, 38, 13–26.

    Li, J., & Kohanyi, E. (2017). Towards modeling false memory with computational knowledge bases.Topics in Cognitive Science (TopiCS), 9, 102–116.

    Mitchell, T., et al. (2015). Never-ending learning. Proceedings of the 29th AAAI Conference onArtificial Intelligence (AAAI).

    Nason, S., & Laird, J. E. (2005). Soar-RL: Integrating reinforcement learning with Soar. CognitiveSystems Research, 6, 51–59.

    15

  • J. LI

    Salvucci, D. D. (2015). Endowing a cognitive architecture with world knowledge. Proceedings ofthe 37th Annual Conference of the Cognitive Science Society (CogSci).

    Soru, T., Marx, E., Moussallem, D., Publio, G., Valdestilhas, A., Esteves, D., & Neto, C. B. (2017).SPARQL as a foreign language. Proceedings of the Posters and Demos Track of the 13th Inter-national Conference on Semantic Systems (SEMANTiCS).

    Taatgen, N., Lebiere, C., & Anderson, J. (2005). Modeling paradigms in act-r. In R. Sun (Ed.),Cognition and multi-agent interaction: From cognitive modeling to social simulation, 29–52.Cambridge University Press.

    Taatgen, N. A. (2013). The nature and transfer of cognitive skills. Psychological Review.

    Taylor, M. E., Matuszek, C., Smith, P. R., & Witbrock, M. (2007). Guiding inference with policysearch reinforcement learning. Proceedings of the 20th International Florida Artificial Intelli-gence Research Society Conference (FLAIRS).

    Xiong, W., Hoang, T., & Wang, W. Y. (2017). DeepPath: A reinforcement learning method forknowledge graph reasoning. Proceedings of the 2017 Conference on Empirical Methods in Nat-ural Language Processing (EMNLP) (pp. 564–573). Association for Computational Linguistics.

    16