geometric concept acquisition by deep reinforcement learning · 2017-09-25 · geometric concept...

Geometric Concept Acquisition by Deep Reinforcement Learning

Alex KueflerCS 331B: Representation Learning in Computer Vision

Stanford [email protected]

Abstract

Explaining how intelligent systems come to embodyknowledge of deductive concepts through inductive learningis a fundamental challenge of both cognitive science and AI.This project address this challenge by exploring how deepreinforcement learning agents, occupying settings similarto early-stage mathematical concept learners come to rep-resent ideas such as rotation and translation. I first train aDueling Deep Q-Network on a shape sorting task requiringimplicit knowledge of geometric properties, then transferits learned parameters to tasks querying explicit geometricknowledge, such as classification. Furthermore I conduct anumber of qualitative experiments that offer intuitions as tohow shape identity is encoded in the network.

1. IntroductionMathematical concepts are unique in that they are taken

to be formally definable and deductively true, given a setof axioms. In contrast, most human mental representa-tions, such as visual categories and language concepts, re-sist precise definitions [1] and only come to be known af-ter considerable inductive experience. Similarly, deep neu-ral networks have achieved human-level or state of the artperformance on tasks such as object recognition [2], gameplaying [3, 4], and speech generation [5] by learning dis-tributed (rather than symbolic) representations through in-ductive (rather than deductive) training. The empirical suc-cess of human learners and artificial neural networks con-trasts sharply with the description of mathematical conceptsas abstract, formal, and universal ideals. This work is botha proof of concept and an exploratory analysis. I demon-strate that a domain-agnostic learning algorithm trained ona task comparable to child’s play is able to represent geo-metric concepts which are only implicitly coded in its train-ing task’s structure.

In particular, I develop a simulated shape sorting toy sim-ilar to those enjoyed by toddlers. Such a learning environ-ment requires an agent to carry out sequences of actions

Figure 1. Example frames from shapesorting environment. Bluecursor indicates grabbing, green is not grabbing.

and reason about properties of 2D shapes to obtain reward. Itrain a variant of the popular Deep Q-Network (DQN) to ex-pertise in this task, then evaluate its representation of shapecategories both quantitatively and qualitatively.

From a cognitive science perspective, shape sorting is aminimal, yet realistic activity in early-stage mathematicalreasoning. Although some past work has assessed shapesorting ability at different age groups [6], a goal of thisproject is to demonstrate that deep learning may provide acomputational paradigm for building on psychological the-ory and generating new hypotheses about geometric con-cept acquisition.

From an artificial intelligence (AI) perspective, this workalso assesses the merits of reinforcement learning (RL) as afamily of algorithms to drive acquisition of generalizable

1

representations. Although RL is less studied than super-vised [7] and unsupervised [?] approaches, it features twomain properties which may be advantageous for knowledgetransfer. First, RL is sequential and may be used to uncovertemporally extended patterns (e.g., transformations like ro-tation and translation) in feedforward architectures. Sec-ond, RL is weakly supervised, which may drive represen-tation learning for families of concepts related to, but notexplicitly measured by an objective function.

2. Previous WorkBlocks worlds have a long history in both AI and cogni-

tive pscyhology. SHRDLU [8], one of the earliest instances,relied on the simple, discrete semantics of the setup in orderto explore natural language understanding in an instruction-following system. Later work used rule-based learning sys-tems to reason about various structures that could be formedin such an idealized world [9].

Within the domain of cognitive psychology, blocksworlds are often used to study and model intuitive physics[10, 11]. That is, how people reason about properties suchas balance and gravitational forces acting on 3D objects.Although such environments, like the one I present, featurea finite set of discrete entities adhering to certain rules of in-teraction, their broad properties and questions of investiga-tion tend to differ. For instance, these simulations emulateproperties of the physical world, like velocity. I seek to un-derstand abstract properties, such as shape and congruence,and thus make use of an environment with few physical con-straints. In this respect, the present work may have morein common with theorem solvers [12, 13] or early geomet-ric reasoning programs [14] than traditional blocks worlds.A more recent analogue would be GeoS [15], a machinelearning based system for solving solving SAT problemsgiven diagrams and natural language descriptions. How-ever, GeoS is instantiated as a multi-staged pipeline consist-ing of many modules, without learning its representationsfrom raw inputs.

Some work in robotics also makes use of a similar taskformulation. The "peg and hole" problem typically in-volves manipulating pegs and bringing them into alignmentwith corresponding holes. In this domain, intelligent sys-tems have the additional constraint of responding to tactilefeedback and may focus more on low level control behav-iors [16, 17] than planning or decision making. The cu-rious robot of [18] is however much more closely alignedwith the motivations of this project, in the sense that itviews representation learning as an active process, drivenby information-seeking actions made on behalf of an agent.However, their model is trained through supervised means,rather than RL.

It should also be noted that although the present workdoes not deal directly with human subject data, its motiva-

tions are consistent with a number of studies probing deeplearning systems for human-like representations. Compar-isons have been made between attentional policies [19],similarity judgements between visual categories [20], andcategory learning [21] in humans and neural networks.

3. ApproachThis work frames geometric concept acquisition as a RL

problem in a simplified 2D environment. As such, the ex-perimental setup can be decomposed into an environmentand agent.

3.1. Environment

During the initial training stage the neural networkmodel interacts with a simulated shape sorting toy (shownin Figure 1), which may be interpreted as a finite horizonMarkov Decision Process (MDP) with deterministic tran-sitions and high-dimensional states. Environment interac-tions are divided into trials, which consist of at most 500timesteps. At timestep t an image xt ∈ R84×84 is emit-ted and depicts some combination of two types of objects:blocks and holes. Every object in xt is characterized bya scalar angle and position vector, which remain fixed inholes, but are subject to change in blocks. Each object isalso characterized by a convex, 2D shape drawn from theset S, which includes squares, trapezoids, equilateral trian-gles, right triangles, and hexagons. Once a shape is chosenfor an object, it is held constant throughout the course ofa trial. xt also displays an effector, or "hand" used by anagent to manipulate blocks’ positions and orientations.

The initial frame x0 includes three blocks whose shapeassignments Sb are drawn uniformly with replacement fromS with randomized positions and orientations. Four holeswith random orientations are also given shape assignmentsSh drawn without replacement from S. A constraint Sb ⊂Sh ensures that no block will be generated without a cor-respond hole. The positions for holes are also randomizedeach trial, but only drawn from four possible, equidistantpositions. This second constraint ensures that holes neveroverlap. In contrast, blocks may overlap (sometimes com-pletely) but are manipulable, and can be disentangled by anagent.

Given xt an agent responds with action at which isdrawn from a set of seven discrete choices: up, down, left,right movements, toggle grab, or rotate clockwise or coun-terclockwise. Rotations are 30° and a single cardinal move-ment covers 10% of the height or width of xt. If the grabis active, blocks will "stick" to the hand, changing positionand orientation as the effector does. If the grab is inactive,movement and rotation actions bear no influence on blocks.

The environment uses a handcrafted reward functionwhich assigns a small penalty −0.001 when the effectorcontacts the border of the screen, a small reward when the

2

Figure 2. The Dueling Deep-Q Network architecture. Image from[25].

effector grabs a block +0.001, a large reward +10.0 whenthe effector fits a block to a correspond hole and a very largereward +50.0 when all blocks have been fitted to holes. Afit occurs when the effector "releases" a block over a hole,and the block’s vertex set is contained by the hole’s vertexset. If a fit occurs, the block disappears and will not returnfor the remainder of the trial.

The environment was developed using PyGame and isinspired by OpenAI Gym’s interface.

3.2. Agent

The goal of this work is to evaluate the interpretabilityand transferability of representations learned through RL totasks requiring explicit judgments about geometry. As such,a relevant learning agent must be (1) Deep, or capable ofexpressing multiple, hierarchical representations that couldfeasibly embody geometric invariants, given raw pixels, (2)Psychologically plausible, or sufficiently similar to animaldecision making to suggest research directions for cognitivescience, and (3) Powerful enough to solve the non-trivialMDP described in section 3.1. As such, a Deep Q-Network(DQN) [?] was used to perform the shape sorting task.

DQN has attained state of the art results on similar taskswhich include discrete action spaces and high-dimensionalstate spaces. Sharing many architectural properties withconvolutional neural networks, it learns a succession of hid-den representations which can be visualized and interpretedas an abstraction hierarchy [22]. Furthermore, DQN fea-tures some desiderata as a model of animal decision mak-ing. A prioritized replay pool has been compared to hip-pocampal learning mechanism [23], and the architecture istrained using temporal difference (TD) learning, which hasbeen shown to underpin some forms of animal learning [24].

TD learning is here accomplished by minimizing theadaptive loss function

Li(θi) = E(s,a,r,s′)∼D[(yi −Q(s, a; θi))

2]

(1)

with target value

yi = r + γmaxa′

Q(s′, a′; θi−) (2)

State-action-reward sequences (s, a, r, s′) observed dur-ing environment interactions are drawn from a replay bufferD and used as training samples. Q(s, a) represents the sumof discounted future rewards if action a is taken from state sand is estimated at epoch i by a DQN parameterized by θi.Updating the model using estimates from a target networkparameterized by θi− has been shown to improve the stabil-ity of training. A policy can then be induced from the DQNusing ε greedy, by selecting actions maximizing Q(s, a)with probability ε and otherwise selecting exploratory, ran-dom actions.

In this work, Q(s, a) is represented by a Dueling DeepQ-Network (DDQN) which is subject to the same TD learn-ing paradigm as DQN, but features a different architecture[25]. DDQN follows a convolutional network with two dis-joint, fully-connected streams that represent the scalar valueof a state V (s) and the "advantage" A(s, a) = Q(s, a) −V (s) of a state-action pair separately. These representationsare then merged with the broadcasting rule

Q(s, a; θ, α, β) =

V (s; θ, β) +A(s, a; θ, α)− 1

|A|∑a′

A(s, a′; θ, α)(3)

where θ, α, and β parameterize the convolutional, value,and advantage subnetworks respectively. DDQN has beenshown to improve the state of the art beyond the perfor-mance of DQN and is a natural choice for this project, dueto the representational expressiveness introduced by the ad-ditional value and advantage layers.

The DDQN used in this study is implemented in Tensor-Flow and adapted from source code found at [26].

4. Reinforcement Learning ResultsThe DDQN was trained to complete the environment

task described in section 3.1 over the course of 1 week ona single GeForce GTX 980 graphics processing unit. Re-sults are shown in Figure 3. Most notably, the agent wasunable to complete a single game (fit all blocks) before theend of the time horizon until about halfway through train-ing, at which point it started to improve. By the last epoch,the agent was fitting 7× the number of games into a singletraining cycle. Reward still seems to be increasing at theend of training, implying that given more time and com-putational resources, this network architecture may haveachieved even greater results on the task.

5. ExperimentsThis work investigates the hypothesis that although the

network is trained with no supervision signal explicit de-noting the existence of shape, its hidden representationswill nevertheless capture shape identity of the manipulable

3

Figure 3. RL results. Top plot shows average reward over all time-steps in an epoch, whereas middle plot is average turn on episodes.Bottom plot shows number of games completed each epoch.

blocks. Furthermore, I anticipate that interaction throughthe shape sorting environment will allow the model to rep-resentation concepts such as rotational similarity. I performtwo quantitative analyses using linear classification and in-dependent t-tests to assess the model, and provide furtherintuitions using visualization techniques.

5.1. Data

The labelled training set used in both experiments bor-rows elements from the shape sorting environment. It in-cludes 15,000 training and validation examples, where theith example features an orientation θi, position pi, and con-vex shape si drawn from set S and visualized in image xi.Unlike the observations drawn during RL, these training im-ages feature only a single block object each (i.e., no holes,effector, or borders are visible). Because the DDQN wastrained on concatenated frames (allowing it to view partialhistories), the labelled images were tiled over 4 channels,allowing them to retain the same dimensionality as shapesorter states. The trained DDQN architecture was then ap-plied to each xi as a fixed feature extractor, producing anencoding zij , or activation vector over layer j.

As such, each training pair is treated as a tuple(xi, zi

j , θi, pi, si).

5.2. Shape Classification

The first experiment measures the interpretability ofshape categories, as encoded by each zij . Within the con-text of computational neuroscience, linear classifiers havebeen used to decode information about categorical stimulifrom neural responses [27]. I adopt a similar approach. In-tuitively, because neural networks are universal function ap-

Figure 4. Example images from single shape dataset. Note thatDDQN inputs are bitmaps: The colors of the shape dataset are cho-sen as to indicate that the shapes are "Blocks" rather than "Holes"

proximators [28], a well-tuned network’s learned represen-tations should be discriminable up to an affine transforma-tion. As such, I assess the classification accuracy of threelinear models trained on encodings from different layers.

It could also be argued that shape representations in thenetwork depend heavily on scene context. For example,when a scene contains multiple blocks it may not be use-ful to encode any information about shape identity until theeffector has taken hold of a single block, as only then mustit make a choice contingent on the identity of the shape. Totest this hypothesis, I repeat the discrimination experimentfor three conditions: (1) no effector is present, (2) the effec-tor is placed over the shape, but is not "grabbing", and (3)the effector is grabbing the shape. As a baseline, I repeatthe same analysis on encodings produced by an untrained,randomly initialized neural network with the same architec-ture.

Results shown in Table 1 appear to support the hy-pothesis that shape information is scene dependent, albeitslightly. At least at the level of FA and FV classificationaccuracy was consistently greater by about 5-9% when theeffector was over the block. However, the results also dis-confirm the hypothesis that shape information is predom-inantly encoded in the higher layers. Whereas the FA andFV encodings for each classifier obtain only about 30% val-idation accuracy, the encodings of C3 are the most discrim-inable, achieving closer to 70% accuracy. Given that theshape objects are fairly primitive, this result may be unsur-prising. Whereas C1 encodes edge information, it is likelyeasy to reconstruct shape identity given only a combinationof C1’s features. This outcome leaves open the questionof what information is encoded in later layers, if not shapeidentity. This question is addressed in the following sec-tions.

4

Validation ResultsLDA Softmax SVM

No Cursor

FA 30.93;27.90(3.03) 31.10;27.83(3.27) 34.57;39.10(-4.53)FV 32.43;30.13(2.30) 32.80;29.73(3.07) 34.20;39.40(-5.20)C3 42.83;40.37(2.47) 42.30;39.13(3.17) 43.87;64.77(-20.90)C2 71.50;57.07(14.43) 68.53;52.67(15.87) 74.00;70.57(3.43)C1 74.73;67.40(7.33) 72.37;66.40(5.97) 47.03;41.47(5.57)

Cursor Over

FA 39.30;26.47(12.83) 38.87;26.30(12.57) 40.27;42.57(-2.30)FV 38.60;27.23(11.37) 37.87;26.90(10.97) 39.03;43.33(-4.30)C3 65.50;36.37(29.13) 63.60;34.27(29.33) 65.97;67.20(-1.23)C2 70.40;54.73(15.67) 69.60;46.83(22.77) 73.67;71.07(2.60)C1 69.37;63.37(6.00) 70.30;60.30(10.00) 49.43;40.97(8.47)

Cursor Grab

FA 35.83;28.47(7.37) 35.03;28.47(6.57) 37.23;41.00(-3.77)FV 35.87;29.50(6.37) 35.43;29.50(5.93) 36.47;42.10(-5.63)C3 62.97;35.47(27.50) 61.10;33.50(27.60) 62.93;65.57(-2.63)C2 72.27;50.17(22.10) 70.57;43.63(26.93) 73.67;69.53(4.13)C1 65.27;60.23(5.03) 66.97;58.30(8.67) 49.40;43.43(5.97)

Table 1. Validation accuracy on 5-way shape classification fromtrained (left) and randomized (right) DDQN encodings usinga support vector machine (SVM), linear discriminant analysis(LDA) and softmax regression. Difference between accuracies inbold. FA and FV are hidden layer activations associated with ad-vantage and value outputs. Ci is ith convolutional layer.

It must also be noted that encodings given by the randomnetwork underperformed the trained network in all condi-tions, except those in which discrimination was done by thesupport vector machine (SVM). However, these encodingsalso tended to perform above chance or in some cases, onpar with the trained network. Their surprisingly positiveresults are likely due to hierarchical properties of convolu-tional neural networks, explored in other work [29].

5.3. Single Cell Responses

Although the classifier study gives an impression of howthe network distributes information across a vector of ac-tivations, we may also wish to know how single neuronsrespond to shape inputs. Given the five shape classes, Ialso perform a series of independent t-tests for every neu-ron in the network with respect to every possible pair ofshape inputs (of which there are 10). We can then considerthe "response count" of a single neuron to be the number ofpairs of shapes for which the test was statistically significant(p < 0.05

10 = 0.005).Table 2 shows the response counts for all the neurons

in the network, averaged across each layer. In the case ofrandomly initialized weights, this average response count(ARC) seems simply to increase with the depth of the net-work. However, the DDQN actively discards shape infor-mation in the value hidden layer when the cursor is visible,given that every shape eventually yields the same amountof value once grabbed. However, because some shapes areharder to fit than others, we may be seeing an increased

Average Layer ResponsivenessDDQN Random

No Cursor

FA 0.95 1.65FV 1.03 1.58C3 0.41 1.35C2 1.33 1.11C1 0.92 1.25

Cursor Over

FA 2.05 1.65FV 1.94 1.58C3 2.06 1.35C2 1.36 1.11C1 0.59 1.25

Cursor Grab

FA 1.58 1.67FV 1.42 1.49C3 1.49 1.32C2 1.20 0.96C1 0.51 0.82

Table 2. Response count averaged over every neuron in each layer.

ARC in FV when no cursor is present. Intuitively, oncethe network has a block, it knows it will receive a certainamount of reward so the neurons in FV are less discrimi-native than those in FA, which are selecting between futureactions. However, when no effector is present, and no fu-ture actions need to be evaluated, the trend reverses. A gamescreen featuring blocks with lower orders of symmetry (thusrequiring more rotations) will necessarily yield less rewardbecause of the discount factor. So FV neurons must be moreshape sensitive in these circumstances.

5.4. Projected Representations

It is natural to ask that if the deeper layers are codingless information about shape, for what information do theycode? Here I use principal component analysis (PCA) touncover the two directions of greatest variance in the en-codings and plot a sample of 1,000 data-points. Given thatthe "No Cursor" condition performed both the best (at layerC2), yet the worst at the fully connected layers accordingto the classification results, I have chosen this condition forvisualization.

Figure 5 displays the principal projections of each layerfrom the trained neural network. Colorizing each data-point according to its position on the input frame revealsa strong coherence in the encodings. In contrast, colorizingeach data-point according to its corresponding shape iden-tity does not. This trend is consistent for every layer ofthe network, and becomes more pronounced deeper into thenetwork. Notably, activations at layer FA appear to collapseback into the center, whereas FV activations remain ring-

5

Figure 5. Encodings from DDQN projected onto first principalcomponents. (Left) colorization by shape position (right) coloriza-tion by shape identity.

like. This collapse is likely caused by the fact that, in theenvironment, the position of holes are randomized, but fallwithin a few canonical positions around the edges. There-fore, images featuring blocks closer to the edges will tendto have a greater value. In contrast, FA is forced to learn in-formation about the advantage function, which is agnosticto state value, and therefore less sensitive to the position ofthe blocks.

Figure 6 shows the same dataset encoded by an un-trained network. Interestingly, activations at C1 appear al-most identical to the trained network, and here too encod-ings separate based on their position deeper into the con-volutional architecture (but not as cleanly). However, thisspatial information is mostly lost by the fully connected lay-ers. These results indicate that the topographic organizationof CNNs naturally retains spatial information, but to per-

Figure 6. Encodings from random network projected onto firstprincipal components. (Left) colorization by shape position (right)colorization by shape identity.

form the shape shorting task, the DDQN has exerted addi-tional optimization effort to ensure this spatial informationis available for state valuation, and to a lesser extent, com-puting advantages.

Given that the position of the shapes accounts for somuch of the variance in the encodings, we might ask howrepresentations look when position information is removed.A second dataset was constructed where shape identity andangle were varied, but the position remained fixed in thecenter of the frame. PCA results for this dataset are shownin figure 7 and compared against random orthogonal projec-tions discovered by the QR decomposition. Here the encod-ings do appear to be broadly clustered around shape iden-tity. Even at the level of C1, hexagons (dark blue) form themost coherent group. Early in processing, this coherencemay be caused by the fact that the hexagon has the most

6

Figure 7. Projections of centered dataset. Classes are trapzoid(red), right triangle (green), hexagon (dark blue), equilateral tri-angle (yellow), square (light blue).

obtuse angles of any shape and appears almost round atthe 84x84 resolution. Therefore, very different edge detec-tors will be activated by the hexagon than the other, sharpershapes. Interestingly, this coherence is maintained deeperinto the network, where squares (light blue) also differenti-ate into a clear grouping. Both hexagons and squares havethe highest orders of symmetry of any shape in the dataset.Given that the environment’s rotational step size is set at30° it takes 2 and 3 steps, respectively, to turn the blocksto an identical orientation. As such, we would expect themto be evaluated differently than other shapes at FA, becausetheir presence reliably implies that the network must per-form fewer actions to obtain reward.

Equilateral triangles (yellow) and trapezoids (red) aretightly grouped together as they are distinguishable fromone another by only a few pixels (the "top" tip of the equilat-

eral triangle). Representations of the right triangle (green)are also quite similar to these shapes’ encodings, due to theright triangle’s visual similarity. However, its low order ofsymmetry ensures it can be placed at non-canonical angles,accounting for its amount of spreading away from the otherclasses.

Visualizations were also made using t-SNE [30], butwere generally found to be less informative than those pre-sented here.

5.5. Preferred Stimuli

Projecting encodings into 2D gives a good impressionof how shapes are distributed with respect to one another,and throughout the representational space. However, theapproaches from the previous section yield little intuitionabout how shapes are modeled by the network individu-ally. One visualization technique [31] addresses this prob-lem by computing a "class model visualization", or an im-age I maximizing the score Sc for a given class c,

argmaxI

Sc(I)− λ‖I‖22 (4)

If Sc is computed by a differentiable architecture, suchas a neural network, Equation 4 is realized as backpropaga-tion into an input image, rather than through the weights ofa trained network. Intuitively, this should result in an im-age that best exemplifies c, or maximally excites the outputneurons computing class score.

Here I perform softmax regression on the outputs oflayer C2 to compute Sc for c = 1, ..., 5, as this layerachieved the highest discriminable according to the linearclassifier experiments. The softmax classifier was trainedon encodings from a small sample (50 examples) subset ofthe centered shape dataset, so that the features of the result-ing images are localized in the center of the frame. Equation4 was regularized with λ = 0.01 and optimized using Adamgradient descent [32] for 1,000 iterations.

Figure 8 shows the resulting class models. Although theclassifier was trained with images from the No Cursor con-dition, the maximizing images nevertheless tend to featurea distinct circle - the shape of the cursor - in the center ofmost frames. This result indicates that the network best en-codes shape information, when the shapes are currently un-der control of the agent. Importantly, this effect is strongenough to be reflected in the class score when the classifieritself sees no examples of an effector.

The class models also appear to become more texturedin the most recent channels. Recall that, even though theenvironment observations are in R84×84, DDQN traininginvolves stacking a history of past observations in the net-work’s input space. The frames on the left side of Figure8 are further from the current time-step, and are smoother,having received less optimization effort than the more re-

7

Figure 8. Images maximizing class score for each shape. Eachcolumn represents a different input channel in grayscale.

cent time-steps. In other words, it appears that the net-work mostly only relies on the current frame for evaluatinga shape’s identity.

Another notable effect is that the distinctiveness of theimagined effector is more consistent over time for theshapes with lower orders of symmetry (trapezoid, righttriangle). For shapes with higher orders of symmetry(hexagon, equilateral triangle, square), the cursor does notappear until the most recent frame. One explanation is thatbecause higher order shapes are easier to fit, the networkspends fewer timesteps observing them gripped by the ef-fector. So the features of the effector are more tightly cou-pled to the network’s representation of the less symmetricshapes.

6. Conclusions

Learning mechanisms and computational principles un-derlying mathematical cognition are not well understood.However, deep neural networks, able to integrate informa-tion from many modalities needed to reason mathematically(i.e., images, natural language, symbols) provide excitingopportunities for exploring this field of inquiry. Here I havehypothesized that RL, which incorporates active probingof an environment, serves as sufficient training signal forlearning many interesting properties of geometric abstracts

embodied only implicitly in the training task. In particu-lar, we saw that shape identity can be recovered from thenetwork’s hidden layers using linear classifiers and that thisinformation is more strongly encoded in later convolutionallayers than in the final hidden layers needed to valuate statesand possible actions.

Many points also indicate that the symmetry propertiesof certain shapes, which affect the difficulty of achieving afit, constrain the internal representations of the network. Inparticular, higher order shapes are more identifiable fromthe hidden layers of the network and lower order shapes,based on class model visualizations, seem to be stronglyassociated with the agent’s effector. However, these resultswere largely uncovered through exploratory analysis, andcan be further validated through future work. In particular,extending the shape sorting environment with a larger classof shape options that vary smoothly in increasing order ofsymmetry may be beneficial.

Follow up studies may also investigate how increasinglypsychologically plausible, RL-based reward signals suchas curiosity [33] or auxiliary goals [34] affect representa-tion learning. Ultimately, mathematical cognition is a com-plex, distinctively human behavior that involves both ab-stract representation, as well motivation in the form of in-trinsic reward and teaching context to do well. Hopefullythis work can provide some early intuitions into how suchconcepts can be embodied in a distributed system that learnsactively from raw visual inputs.

7. AcknowledgementsI would like to thank James McClelland and Mykel

Kochenderfer for useful recommendations and conversa-tions regarding this project.

References[1] A. Biletzki and A. Matar. Ludwig wittgenstein. Stan-

ford Encyclopedia of Philosophy, 2014.

[2] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and JianSun. Deep residual learning for image recognition.arXiv preprint arXiv:1512.03385, 2015.

[3] Volodymyr Mnih, Koray Kavukcuoglu, David Sil-ver, Andrei A Rusu, Joel Veness, Marc G Bellemare,Alex Graves, Martin Riedmiller, Andreas K Fidjeland,Georg Ostrovski, et al. Human-level control throughdeep reinforcement learning. Nature, 518(7540):529–533, 2015.

[4] David Silver, Aja Huang, Chris J Maddison, ArthurGuez, Laurent Sifre, George Van Den Driessche, Ju-lian Schrittwieser, Ioannis Antonoglou, Veda Panneer-shelvam, Marc Lanctot, et al. Mastering the game of

8

go with deep neural networks and tree search. Nature,529(7587):484–489, 2016.

[5] Aaron van den Oord, Sander Dieleman, HeigaZen, Karen Simonyan, Oriol Vinyals, Alex Graves,Nal Kalchbrenner, Andrew Senior, and KorayKavukcuoglu. Wavenet: A generative model for rawaudio. arXiv preprint arXiv:1609.03499, 2016.

[6] William F Burger and J Michael Shaughnessy. Char-acterizing the van hiele levels of development in ge-ometry. Journal for research in mathematics educa-tion, pages 31–48, 1986.

[7] Pulkit Agrawal, Ross Girshick, and Jitendra Malik.Analyzing the performance of multilayer neural net-works for object recognition. In European Conferenceon Computer Vision, pages 329–344. Springer, 2014.

[8] Terry Winograd. Understanding natural language.Cognitive psychology, 3(1):1–191, 1972.

[9] Patrick H Winston. Learning structural descriptionsfrom examples. 1970.

[10] Jessica Hamrick, Peter Battaglia, and Joshua B Tenen-baum. Internal physics models guide probabilisticjudgments about object dynamics. In Proceedings ofthe 33rd annual conference of the cognitive sciencesociety, pages 1545–1550. Cognitive Science SocietyAustin, TX, 2011.

[11] Renqiao Zhang, Jiajun Wu, Chengkai Zhang,William T Freeman, and Joshua B Tenenbaum. Acomparative evaluation of approximate probabilisticsimulation and deep neural networks as accounts ofhuman physical scene understanding. arXiv preprintarXiv:1605.01138, 2016.

[12] Herbert Gelernter. Realization of a geometry theoremproving machine. In IFIP Congress, pages 273–281,1959.

[13] Chris Alvin, Sumit Gulwani, Rupak Majumdar, andSupratik Mukhopadhyay. Synthesis of geometry proofproblems. In AAAI, pages 245–252, 2014.

[14] Thomas G Evans. A heuristic program to solvegeometric-analogy problems. In Proceedings of theApril 21-23, 1964, spring joint computer conference,pages 327–338. ACM, 1964.

[15] Minjoon Seo, Hannaneh Hajishirzi, Ali Farhadi, OrenEtzioni, and Clint Malcolm. Solving geometry prob-lems: Combining text and diagram interpretation. InProceedings of the 2015 Conference on EmpiricalMethods in Natural Language Processing, EMNLP,pages 17–21, 2015.

[16] Hyeonjun Park, Ji-Hun Bae, Jae-Han Park, Moon-Hong Baeg, and Jaeheung Park. Intuitive peg-in-holeassembly strategy with a compliant manipulator. InRobotics (ISR), 2013 44th International Symposiumon, pages 1–5. IEEE, 2013.

[17] Sergey Levine, Chelsea Finn, Trevor Darrell, andPieter Abbeel. End-to-end training of deep visuomo-tor policies. Journal of Machine Learning Research,17(39):1–40, 2016.

[18] Lerrel Pinto, Dhiraj Gandhi, Yuanfeng Han, Yong-Lae Park, and Abhinav Gupta. The curious robot:Learning visual representations via physical interac-tions. arXiv preprint arXiv:1604.01360, 2016.

[19] Abhishek Das, Harsh Agrawal, C Lawrence Zitnick,Devi Parikh, and Dhruv Batra. Human attention invisual question answering: Do humans and deep net-works look at the same regions? arXiv preprintarXiv:1606.03556, 2016.

[20] Joshua C Peterson, Joshua T Abbott, and Thomas LGriffiths. Adapting deep network features to cap-ture psychological representations. arXiv preprintarXiv:1608.02164, 2016.

[21] Andrew M Saxe, James L McClelland, and SuryaGanguli. Learning hierarchical category structure indeep neural networks. In Proceedings of the 35th an-nual meeting of the Cognitive Science Society, pages1271–1276, 2013.

[22] Matthew D Zeiler and Rob Fergus. Visualizingand understanding convolutional networks. In Euro-pean Conference on Computer Vision, pages 818–833.Springer, 2014.

[23] James L McClelland, Bruce L McNaughton, and Ran-dall C O’Reilly. Why there are complementary learn-ing systems in the hippocampus and neocortex: in-sights from the successes and failures of connectionistmodels of learning and memory. Psychological re-view, 102(3):419, 1995.

[24] Ashvin Shah. Psychological and neuroscientific con-nections with reinforcement learning. In Reinforce-ment Learning, pages 507–537. Springer, 2012.

[25] Ziyu Wang, Nando de Freitas, and Marc Lanctot. Du-eling network architectures for deep reinforcementlearning. arXiv preprint arXiv:1511.06581, 2015.

[26] DevSisters corp. Dqn-tensorflow.https://github.com/devsisters/DQN-tensorflow, 2016.

9

https://github.com/devsisters/DQN-tensorflow

https://github.com/devsisters/DQN-tensorflow

[27] Thomas Naselaris, Kendrick N Kay, Shinji Nishimoto,and Jack L Gallant. Encoding and decoding in fmri.Neuroimage, 56(2):400–410, 2011.

[28] Kurt Hornik, Maxwell Stinchcombe, and HalbertWhite. Multilayer feedforward networks are universalapproximators. Neural networks, 2(5):359–366, 1989.

[29] Andrew Saxe, Pang W Koh, Zhenghao Chen, Ma-neesh Bhand, Bipin Suresh, and Andrew Y Ng. Onrandom weights and unsupervised feature learning.In Proceedings of the 28th international conferenceon machine learning (ICML-11), pages 1089–1096,2011.

[30] Laurens van der Maaten and Geoffrey Hinton. Visu-alizing data using t-sne. Journal of Machine LearningResearch, 9(Nov):2579–2605, 2008.

[31] Karen Simonyan, Andrea Vedaldi, and Andrew Zisser-man. Deep inside convolutional networks: Visualisingimage classification models and saliency maps. arXivpreprint arXiv:1312.6034, 2013.

[32] Diederik Kingma and Jimmy Ba. Adam: Amethod for stochastic optimization. arXiv preprintarXiv:1412.6980, 2014.

[33] Rein Houthooft, Xi Chen, Yan Duan, John Schulman,Filip De Turck, and Pieter Abbeel. Vime: Variationalinformation maximizing exploration. In Advances inNeural Information Processing Systems, pages 1109–1117, 2016.

[34] Max Jaderberg, Volodymyr Mnih, Wojciech MarianCzarnecki, Tom Schaul, Joel Z Leibo, David Sil-ver, and Koray Kavukcuoglu. Reinforcement learn-ing with unsupervised auxiliary tasks. arXiv preprintarXiv:1611.05397, 2016.

10

geometric concept acquisition by deep reinforcement learning · 2017-09-25 · geometric concept...

Documents