going deeper: autonomous steering with neural memory...

Going Deeper: Autonomous Steering with Neural Memory Networks

Tharindu Fernando Simon Denman Sridha Sridharan Clinton Fookes

{t.warnakulasuriya, s.denman, s.sridharan, c.fookes}@qut.edu.au

Image and Video Research Laboratory,

Queensland University of Technology,

Australia.

Abstract

Although autonomous driving is an area which has been

extensively explored in computer vision, current deep learn-

ing based methods such as direct image to action mapping

approaches are not able to generate accurate results, mak-

ing their application questionable. This is largely due to

the lack of capacity of the current state-of-the-art architec-

tures to capture long term dependencies which can model

different human preferences and their behaviour under dif-

ferent contexts. Our work explores a new paradigm in deep

autonomous driving where the model incorporates both vi-

sual input as well as the steering wheel trajectory and at-

tains a long term planning capacity via neural memory net-

works. Furthermore, this work investigates optimal feature

fusion techniques to combine these multimodal information

sources, without discarding the vital information that they

offer. The effectiveness of the proposed architecture is il-

lustrated using two publicly available datasets where in

both cases the proposed model demonstrates human like be-

haviour under challenging situations including illumination

variations, discontinuous shoulder lines, lane merges, and

divided highways, outperforming the current state-of-the-

art.

1. Introduction

Autonomous driving has been a popular topic among re-

searchers, receiving attention from both academic groups

and commercial projects such as Google self-driving cars,

Uber and Tesla.

Even though video cameras offer cheap solutions for the

data capture needs of these systems, current state-of-the-

art methods use high cost devices such as laser sensors and

radars in addition to vision sensors rather than completely

relying on video input. We believe current deep learning

based computer vision approaches [3,19,24] will ultimately

have the power necessary for safe open road driving using

vision based sensors alone.

Our work draws inspiration from recent seminal work

in [24] in which the authors model the autonomous driv-

ing problem as predicting future motion given the present

observation and previous history of the agent. This work,

with the aid of historical states of the agent, achieves com-

mendable results in contrast to popular behaviour reflex ap-

proaches [2, 11, 20–22] that directly map an input image to

a driving action.

Even though this shallow memory architecture is capable

of modelling short term dependencies and achieves signifi-

cant improvement, they fail to capture long term dependen-

cies due to an inherent problem with the sequential LSTM

architecture, that the hidden state activations are dominated

by the most recent inputs [6, 9]. As more and more ob-

servations come into the memory, most recent inputs dom-

inate the memory, completely discarding the vital factors

from long term history. Therefore, the current state-of-the-

art method of [24], which uses sequential LSTMs to esti-

mate the behaviour, fails to model important factors such

as road and weather conditions, or traffic conditions and

density which are vital for long term planning, due to the

underlying limitations of the LSTM representation.

The proposed approach is shown in Fig. 1 (a) and in

Fig. 1 (b) we compare it to the method proposed in [24].

We adopt recent advances in tree memory networks [9] to

develop the ability to capture both long-term and short-term

temporal dependencies which is crucial to effectively model

driving behaviour [3]. Both models receive a sequence of

video frames as the spatial input. In [24] the authors model

past ground truth sensor information for speed and angular

velocity as a trajectory, where as in our model we utilise

the trajectory steering wheel angle. Then the video frames

are passed through a Convolution Neural Network (CNN)

to encode the visual input. The significant differences be-

tween the two approaches arise when considering the pro-

cess of modelling historical states. In the method proposed

in [24], the authors directly merge the two input sequences

and pass the merged feature through a shallow Long Short

1214

Term Memory (LSTM) [12] layer; where as in the proposed

method we utilise separate LSTM models for two input

streams and map their evolution via two separate external

memories, before merging current LSTM outputs together

with the memory outputs in order to generate the final out-

put.

This work offers several novel contributions. First, we

take inspiration from recent works in [24] and [9] to build a

new approach to capture dependencies between dense spa-

tial inputs together with a sparse steering wheel angle tra-

jectory. Second, we propose and evaluate three fusion meth-

ods to fuse spatial and trajectory memory states with the

present input. Finally, we report experimental results of the

proposed model on two popular publicly available driving

datasets and compare to various baselines, where in both

cases the proposed architecture is shown to outperform the

current state-of-the-art.

2. Related work

2.1. Autonomous driving

From the first attempts by Pomerleau et. al [21] to use a

neural network for autonomous driving, which used a shal-

low network to directly map pixel values to simple driv-

ing commands, there has been a growing interest among re-

searchers in achieving fully autonomous vehicle navigation

in complex real world scenarios.

Approaches related to autonomous driving can be

broadly categorised into mediated approaches and be-

haviour reflex approaches. In mediated approaches [1,8,10,

16], the authors map pixels to pre-defined affordance mea-

sures, such as lanes, pedestrians, traffic lights and surround-

ing cars. For example in [8,16] the authors generate bound-

ing boxes on detected cars and in [1] splines for detected

lane markings. In order to utilise such detections for naviga-

tion in [1,8,16] the authors generate a layout of the intersec-

tion and traffic details by passing those various detections

through a probabilistic model. Even though this approach

generates interpretable results, evaluating such a complete

set of measures in real world conditions may be compu-

tationally expensive and unmanageable. Furthermore, in

real world conditions with missing lane markings, discon-

tinuous shoulder lines, varying illuminations, and cluttered

backgrounds, these approaches tend to produce false detec-

tions, which makes their applicability limited [3].

The second class of techniques, behaviour reflex ap-

proaches [2, 11, 20–22], are motivated by human behaviour

and construct a direct mapping from an image to a steer-

ing angle. Although this idea is straightforward, it has not

been capable of generating accurate results for several rea-

sons [3]. Firstly, as humans we have our own preferences

and this may lead to different styles of driving. Hence under

similar contexts human drivers may make completely dif-

ferent decisions. One driver may give way to the merging

traffic where as another may not. Secondly, blind pixel to

action mappings cannot perform long term planning. Even

though it can generate reactive behaviours to avoid colli-

sions, such low level modelling fails to capture the underly-

ing semantics of driving.

Recently, with the advances in recurrent neural net-

works, an extension to the behaviour reflex approach, called

privileged training, is presented [24]. The authors investi-

gate the use of deep visual features extracted via a convo-

lutional neural network (CNN) [4] model and then utilise a

LSTM for sequence modelling. Even though it has been

able to achieve encouraging results when compared with

behaviour reflex approaches, as shown in [5], such a naive

CNN to LSTM mapping fails to capture vital dependencies

among the input sequences. Furthermore such direct merg-

ing of a sparse trajectory input with dense visual features

(see Fig. 1 (b)) may lead to a blind mapping of features

without fully determining the degree of attention that each

feature stream requires.

2.2. Memory architectures

Along with deep learning models such as Recurrent Neu-

ral Networks (RNN) and LSTMs a vast number of ap-

proaches [13, 15, 17, 23] have been proposed that utilise

memory architectures in order to better predict the variable

of interest. The memory stores important facts from histor-

ical inputs in order to capture temporal dependencies. For

instance, in the area of natural language processing Weston

et. al [23] have used a memory module to capture the se-

mantics within different languages. In similar works, in [13]

and in [14], the “memory module” has been utilised in or-

der to capture the coherence among images and their cap-

tions. The experimental results presented in [9] show that

such naive memory architectures [15,23] capture only short

term dependencies, instead of capturing long term depen-

dencies. But capturing both long term and short term de-

pendencies within the memory is extremely important for

long term planning. For instance by having a long term

memory we can capture how a particular driver behaves in

different contexts, adapts to different weather conditions,

or traffic congestion; allowing the memory to describe his

or her driving style, and utilise that information for future

planning. As a consequence of this need to capture both

long and short term dependencies, a memory architecture

called Tree Memory Networks [9] is proposed, which struc-

tures the memory as a hierarchical recursive tree structure,

instead of a flat, sequential layer of LSTM cells; and this

approach is shown to outperform sequential memory archi-

tectures in path prediction tasks.

215

Spatial Input

xspt

cspt-1

CNNSpatial Memory Module

Mspt=S-LSTM(csp

t,Mspt-1)

cspt

Trajectory Input

xtrt

ctrt-1

Trajectory Memory Module

ctrt

Mtrt=S-LSTM(ctr

t,Mtrt-1)

l

l

y

mt ct

c*t

(a) Proposed Method

Input Module

xt

ct-1

y

CNN

Trajectory of speed and course

(b) Method proposed in [24]

Figure 1. Comparison between the proposed method and the current state-of-the-art approach proposed in [24]. In both methods the visual

input is encoded with a CNN. The authors in [24] directly merge the CNN output with the speed and angular velocity trajectory and pass

it through an LSTM layer. In contrast, in the proposed method, the encoded visual input and the steering wheel angle trajectory is passed

through separate LSTM models and their long term history is mapped via two separate external memories. The final output is generated

by merging current LSTM outputs together with the memory outputs.

conv3-256

conv3-256

conv3-256

conv3-512

conv3-512

conv3-512

FC-4096

FC-4096

FC-1000

maxpool

maxpool

maxpool

maxpool

maxpool

conv3-512

conv3-512

conv3-512

soft-max

conv3-64

conv3-64

conv3-128

conv3-128

Xsp

Input

Figure 2. VGG-16 model [4] with 13 convolutional (Conv3-

number-of-kernels) layers followed by 3 fully connected (FC-

number-of-hidden-units) layers. We extract activations from the

first fully connected layer (denoted Xsp).

3. Method

3.1. Spatial and trajectory input modules

3.1.1 Visual encoder

Each video frame in the database is encoded, using a vi-

sual encoder which encodes the spatial information in a

discriminative manner. In the proposed model we utilise

VGG-Net [4] pre trained on ImageNet. In order to capture

discriminative driving actions more precisely, we have fine

tuned the network with the frames in our datasets. Finally,

as shown in Fig. 2, the first fully connected layer activations

are extracted as inputs to the spatial LSTM.

Let Xspi = [xsp

i,1,xsp

i,2, . . . ,xsp

i,S] be the S length se-

quence of the first fully connected layer activations from the

VGG network for the ith example. The spatial input mod-

ule computes a vector representation for the input sequence

via a LSTM layer,

cspt = fLSTM (xsp

t , cspt−1), (1)

where cspt ∈ R

ksp

, and ksp is the embedding dimension of

the LSTM.

3.1.2 Trajectory LSTM

Let Xtri = [xtr

i,1,xtri,2, . . . ,x

tri,S] be the S length sequence

of steering angle for the ith example. The trajectory input

module computes a vector representation for the input se-

quence via a LSTM layer,

ctrt = fLSTM (xtr

t , ctrt−1), (2)

where ctrt ∈ R

ktr

, and ktr is the embedding dimension of

the LSTM.

3.2. Spatial and trajectory memory modules

In order to map the coherence between spatial sequences

we utilise a spatial memory module. Consider N to be a

queued sequence of historical LSTM embeddings for the

spatial input, with length p and embedding dimension k (

N ∈ Rp×k). Based on the exemplary results obtained by

Fernando et al. [9] for long term dependency modelling, we

adapt the same S-LSTM memory model as proposed in [9].

When computing an output at time instance t we ex-

tract out the tree configuration at time instance t − 1. Let

Mspt−1 ∈ R

k×2l−1 be the memory matrix resultant from

concatenating nodes from the top of the tree to l = [0, . . .]depth. The motivation behind using multiple nodes instead

of a single node is to capture the different levels of abstrac-

tion that exist in the memory network. When considering

dense inputs such as fully connected layer activations from

a CNN, this can be extremely useful to increase the capacity

of the model.

216

l l

mtSpatial Memory Trajectory Memory

αsp1 α

sp2 α

sp3

αtr1 αtr

2 αtr3

α α α α α α

(a) Individual attention based fusion with LSTM memory cells

l l

mt

α1 α2 α3 α1 α2 α3

(b) Shared attentions based fusion

l l

mt

(c) Hierarchical fusion

Figure 3. Comparison between different fusion methods. The ex-

tracted memory cells of both the spatial memory module and the

trajectory memory module are shown on the left. The fusion meth-

ods are shown on the right. Memory cells at different levels are

shown in different colours.

The temporal dependencies between the steering angles

can be mapped in a similar way, and as such a second mem-

ory model is used to model these, with data extracted in

the same manner. Following the method presented in [9],

at each time step we update the content of both spatial and

temporal memory units.

3.3. Feature fusion

After feeding data to the memory modules, we have four

data sources: the LSTM encodings for the two inputs (spa-

tial and steering angle), and the memory outputs for these

same two data sources. The structure that we utilise in or-

der to incorporate information from these different sources

is a vital factor for the modelling process. Even though the

spatial and trajectory memories have the ability to capture

the long term behaviour of the human drivers, if the fusion

approach does not consider the features, in order to under-

stand what the salient aspects of the memory components

and how useful they are under certain contexts, then the long

term planning process will be ineffective.

Therefore in order to fuse these neural networks we con-

sider three approaches: i) a naive approach that treats each

modality separately (see Section 3.3.1) ii) a joint approach

that shares the attention between the two modalities (see

Section 3.3.2) and iii) a hierarchical approach that enables

deep layer wise fusion of the two memory outputs (see Sec-

tion 3.3.3) Fig. 3 illustrates these different feature fusion

architectures which are further described in the following

sub-sections.

3.3.1 Individual attention fusion (fu-ia)

As shown in Fig. 3 (a), we have simply flattened the ex-

tracted memory cells and generated individual attention val-

ues for each memory cell in the following manner. Let

fscore be an attention scoring function which can be im-

plemented as a multi-layer perceptron [18],

mspt = fscore(Msp

t−1, cspt ), (3)

αsp = softmax(mspt ) . (4)

The most relevant memory items for the current input are

extracted via an attention mechanism derived using Eq. 3

and Eq. 4,

zspt = M

spt−1[α

sp]T . (5)

Similarly, the attention of the trajectory memory module

can be evaluated as,

mtrt = fscore(Mtr

t−1, ctrt ), (6)

αtr = softmax(mtrt ), (7)

ztrt = Mtrt−1[α

tr]T . (8)

Then the final output y, which is the steering angle in

degrees normalised between 0 and 1, can be represented as,

yt = ReLU(W spzspt + (1−W sp)cspt +W trztrt + (1−W tr)ctrt ),

(9)

where W sp and W tr are the respective output weights,

learned during training.

3.3.2 Shared attention fusion (fu-sa)

In this fusion method we have used a shared attention for

both memory modules. Fig. 3 (b) depicts the idea. Let

fscores be an attention scoring function which accepts both

memory inputs and current contexts and generates attention

weights similar to the pattern shown in the figure,

mt = fscores (Msp

t−1,Mtrt−1, c

spt , ctrt ), (10)

α = softmax(mt), (11)

zspt = M

spt−1[α]

T . (12)

ztrt = Mtrt−1[α]

T . (13)

Then the final output can be represented as,

yt = ReLU(W spzspt + (1−W sp)cspt +W trztrt + (1−W tr)ctrt ),

(14)

where W sp and W tr are the respective learned output

weights.

217

3.3.3 Hierarchical fusion (fu-h)

In this method we map the extracted memory outputs hier-

archically using an S-LSTM module (See Fig. 3 (c)). The

motivation behind the hierarchical fusion mechanism is to

retain the salient features from the two memory networks by

representing them with different levels of abstraction. For

example, the work by Zhu et al. in [26] has shown that re-

cursive LSTM model is capable of achieving state-of-the-art

performance on semantic composition tasks, outperforming

flat, sequential LSTM structures. By combining the acti-

vations from spatial and trajectory memories hierarchically,

we aim to capture how human drivers adapt to different con-

ditions in a logical and orderly fashion.

Furthermore, in [5], the authors investigate the usage of

different information sources together, and their evaluations

revealed that hierarchically modelling features using an S-

LSTM module results in more accurate predictions. This

is further evidence to verify that different feature streams

should not be combined directly. Instead, they should be

merged in a hierarchical form as these different streams may

complement each other, providing vital information under

different contexts, in order to form a very strong system.

Following this reasoning, we extract a fused feature as fol-

lows,

c∗t = fS−LSTM (Mspt−1,M

trt−1). (15)

Then the final output can be represented as,

yt = ReLU(W cc∗t +W spcspt +W trctrt ), (16)

where W c,W sp and W tr are the respective learned output

weights.

4. Experimental results4.1. Datasets

4.1.1 Comma.ai dataset

The comma.ai dataset [2] is a publicly available driving

dataset with 7.25 hours of driving data. The video data is

recorded from a dashboard camera at 20 FPS. The dataset

also has several sensor recordings (i.e. car speed, steering

angle, GPS, gyroscope) that are measured at different fre-

quencies and synchronised to the same sampling rate. As

a preprocessing step we extract non-overlapping video sub-

sequences 36 frames in length. From the resultant video

sequences along with corresponding steering wheel angles,

the first 20,883 examples are chosen for training and the re-

maining 8,951 for testing. Video frames are down-sampled

to 224 x 224 for processing by the VGG-16 model network.

4.1.2 Udacity’s self-driving car dataset

The Udacity self-driving car data set [11] contains video

frames from three front-facing cameras (left, centre, and

right) and vehicle measurements such as speed and steer-

ing wheel angle, recorded from the vehicle driving on the

road. As the video and vehicle measurements are sampled

at different sampling rates a synchronisation process is re-

quired. Therefore we perform the following operations as

pre-processing: i) synchronising video frames with mea-

surements from the vehicle, ii) selecting only centre camera

frames, iii) down sample the video frames to 224 x 224.

iv) extract video sequences of 36 frames in length. The re-

sultant dataset contains 77,308 samples with corresponding

steering wheel angles. We selected the first 54,115 samples

for training and the remaining 23,193 examples are used for

testing.

4.2. Quantitative evaluation

The VGG-16 feature extractor was pre-trained with an

analog output unit to learn the recorded steering angle from

randomly selected single frame images from Udacity’s self-

driving car dataset. After that offline training process the

proposed network is added and we use stochastic gradient

descent (SGD) with momentum of 0.99 and a batch size of

100. Models are evaluated using root mean square error

(RMSE).

Hyper parameters, the length of the memory module, p,

the embedding dimension, k, and the depth of extracted

memory matrix, l, of the two memory modules are eval-

uated experimentally with reference to the hierarchical fu-

sion model (fu-h). Fig. 4 (a) shows the variation in RMSE

against p for the spatial and trajectory memory modules in

solid red and dashed green lines respectively. For the spa-

tial memory, the error converges around p = 250, and for

trajectory memory, the error converges around p = 310 .

Therefore we set the value of p as 256. Fig. 4 (b) shows the

variation of average RMSE against k for the two memory

modules. As the error converges around 300 hidden units,

the embedding dimension is set to 300. With the aid of

similar experiments we evaluated the depth of memory read

location, l, and the resultant error plots are shown in Fig.

4 (c) where we are able to conclude that l = 20 produces

optimal results.

The experimental evaluations are tabulated in Tab. 1.

In the Comma.ai dataset there aren’t standard training and

testing splits. Furthermore for Udacity’s self-driving car

datasets the ground truth labels for the testing split are not

available. Therefore we evaluated all the baselines for our

training-testing splits. For Udacity dataset, we divided the

provided training set to training-testing splits. As the base-

line models we use the deep architectures proposed in [24]

and [20]. As a Behavioural Reflex method we utilise [20]

and for the Mediated Perception Approach, as given in [24],

we first compute the segmentation output of every frame

in Comma.ai and Udacity’s self-driving car datasets using

the Multi-Scale Context Aggregation approach described in

218

Length of memory0 200 400 600 800 1000 1200

Avera

ge

altitude

err

or(feet)

87

88

89

90

91

92

93

94

95

96

Lenght of Memory0 200 400 600 800 1000 1200

20

21

22

23

24

25

26

27

28

29

RM

SE o

f pre

dic

tion (

degre

es)

Spatial Memory

Trajectory Memory

(a) RMSE vs length of memory module p

Length of memory0 200 400 600 800 1000 1200

Avera

ge

altitude

err

or(feet)

87

88

89

90

91

92

93

94

95

96

Hidden State dimention0 200 400 600 800 1000 1200

20

21

22

23

24

25

26

27

28

29

RM

SE o

f pre

dic

tion (

degre

es)

Spatial Memory

Trajectory Memory

(b) RMSE vs embedding dimension k

Length of memory0 200 400 600 800 1000 1200

Avera

ge

altitude

err

or(feet)

87

88

89

90

91

92

93

94

95

96

20

21

22

23

24

25

26

27

28

29

0 20 40 60 80 100

RM

SE o

f pre

dic

tion (

degre

es)

Spatial Memory

Trajectory Memory

Depth of memory read location

(c) RMSE vs depth of extracted matrix l

Figure 4. Parameter evaluation: We evaluate the length of the memory module, p (see (a)), the embedding dimension, k (see (b)) and the

depth of the extracted matrix, l (see (c)).

MethodRMSE (in degrees)

Comma.ai Udacity

Behavioural Reflex [20] 25.6 31.8

Mediated Perception [24] 23.7 29.5

Privileged training [24] 23.1 27.3

fu-ia 21.4 24.3

fu-sa 22.5 24.8

fu-h 20.1 22.5

Table 1. Comparison of our results to the state-of-the-arts for au-

tonomous driving on Comma.ai and Udacity datasets

[25]. For the segmentation mask we utilise the Cityscape [7]

mask. Then mimicking the stage-by-stage training, we train

the LSTM independently from the segmentation process.

For the Privileged training approach we use the method pro-

posed in [24].

Results are presented in Tab. 1, and we observe that all of

our fusion methods outperform the current state-of-the-art.

The behavioural reflex model [20] has the lowest accuracy

verifying our assertion that they do not possess enough ca-

pacity to perform long term planing. The privileged training

model of [24] obtains better performance but still it cannot

completely model the driving style of the driver due to it’s

shallow LSTM structure, which in turn leads to erroneous

predictions.

Our hierarchical fusion method (fu-h) has the lowest er-

ror illustrating that a deep layer wise fusion method is able

to capture different levels of abstraction and semantics that

are present in the different input modalities. Models fu-ia

and fu-sa do not possess such ability but still fu-ia outper-

forms fu-sa demonstrating that different input streams pos-

sess information which requires different degrees of atten-

tion. Therefore in fu-ia we are able to learn that attention

through back propagation where the model learns what acts

as the main information queue and what the complemen-

tary information is. Model fu-h takes this concept to its

next level where it offers layer wise merging of temporal

semantics from the two information streams.

4.3. Qualitative evaluation

(a) (b) (c)

(d) (e) (f)

Figure 5. Qualitative evaluation: Results from Comma.ai dataset

(a)-(c) and Udacity dataset (d)-(f), predictions of fu-h model in red,

behavioural reflex approach in yellow, mediated perception ap-

proach in green, privileged training approach in brown and ground

truths in blue.

In Fig. 5 we show prediction results from our hierarchi-

cal fusion model (fu-h model) in red, the behavioural re-

flex approach in yellow, the mediated perception approach

in green and the privileged training approach in brown for

Comma.ai and Udacity datasets. The blue arrow shows the

driver’s action. Lane change behaviour where the driver

moves from the left lane to the right lane is shown in Fig. 5

(a) and in sub-figures (b) and (c) we show the lane following

behaviour for left and right curved lanes. Challenges with

the Udacity dataset are shown in sub figures (d)-(f), where

the illumination conditions vary from normal to direct sun-

light and shadows within few seconds. The results illustrate

that regardless of such changes, illumination variations, dis-

continuous shoulder lines, lane merges, and divided high-

ways, the proposed model is able to generate accurate pre-

219

dictions and outperform all the baselines. For instance in

Fig. 5 (a), due to the long term dependency modelling abil-

ity, the proposed model is able to anticipate the lane change

behaviour of the driver and generate accurate predictions.

In contrast, all the baselines have shown lane following be-

haviour.

Furthermore, as the illumination conditions fluctuate in

sub figures (d)-(f), the predictions of the behavioural reflex

approach become more erroneous as it directly maps pix-

els to an action. The mediated perception approach and

privileged training approach have been able to tolerate that

to a certain extent using the sequence modelling ability of

the LSTMs but still the predictions aren’t accurate when

compared to the predictions of the fu-h model. As the pro-

posed model has enough capacity to model how the human

divers behave under different road conditions and illumi-

nation conditions, it has accurately anticipated the driver’s

behaviour.

4.4. Visualisation of memory activations

In 6 we visualise the temporal evolution of the 2 mem-

ory networks and the fusion methods. The trajectory of the

steering wheel angle is illustrated in 6 (a). In subfigure (b)

we visualise every 100th frame that is given to the model as

the spatial input. We have randomly selected a hidden unit

from the top most layer of the trajectory and spatial memo-

ries and subfigures (c)-(d) shows activations from respective

memories; and subfigures (e)-(g) visualise the activations

from the proposed fu-ia, fu-sa and fu-h fusion methods re-

spectively.

From the illustrations presented in Fig. 6 it is evident that

both spatial and trajectory memory activations possess vital

information. For instance between time steps 0.5 and 1.5

x104 the spatial input (6 (b)) becomes noisy with changes in

lighting conditions and hence the memory activations show

sudden fluctuations. In such scenarios, with the fu-sa fusion

method where we are using the same attention weights for

both the spatial and trajectory inputs, the model loses the

chance to obtain complementary information that is avail-

able from the trajectory history. Therefore the small spikes

of activations that are visible in the trajectory memory are

not incorporated in the final memory output and it is fur-

ther evident that the fusion process is mainly driven by the

spatial memory output. With the individual attention mech-

anism (fu-ia) we allow the model to learn the different de-

grees of attention that different modalities of input should

receive. The hierarchical structure of fu-h grants the oppor-

tunity for deep layer wise fusion of the salient features, al-

lowing the model to pay careful attention towards different

levels of abstraction. As the model gains enough capacity

in order to jointly backpropagate and learn the instances in

which it should vary it’s attention, we observe a unique dis-

tribution of activations in Fig. 6 (g). Note that 0.5 to 1.5

×104

0 0.5 1 1.5-0.2

0

0.2

(a) Trajectory inputs

(b) Spatial inputs

0 0.5 1 1.5

-0.04

-0.02

0

0.02

0.04

×104

(c) Activations from trajectory memory

0 0.5 1 1.5

-0.04

-0.02

0

0.02

0.04

×104

(d) Activations from spatial memory

×104

0 0.5 1 1.5

0 0.5 1 1.5

-0.04

-0.02

0

0.02

0.04

(e) Activations from fu-ia

0 0.5 1 1.5

-0.04

-0.02

0

0.02

0.04×10

4

(f) Activations from fu-sa

×104

0 0.5 1 1.5

-0.04

-0.02

0

0.02

0.04

(g) Activations from fu-h

Figure 6. Visualisation of activations of memory and fusion meth-

ods. First 2 rows show the trajectory and spatial inputs at different

time steps. Rows (c)-(d) shows activations from trajectory and

spatial memory respectively. In rows (e)-(g) we visualise the acti-

vations from fu-ia, fu-sa and fu-h fusion methods respectively.

x104 there exist several peaks and valleys of activations in

Fig. 6 (g) that are not present in (e) and (f). This further

illustrates our motivation as the model has obtained spe-

cific information from both spatial and trajectory memories

demonstrating their importance.

5. Conclusion

This paper proposed a novel deep learning architecture

for autonomous driving. Instead of blind pixel to action

mapping or shallow planning, the proposed model incorpo-

rates both visual input as well as the steering wheel trajec-

tory and attains a long term planning capability via neural

Tree Memory Networks. Furthermore, we introduced three

220

fusion techniques to combine these multimodal information

sources enabling optimal utilisation of the vital information

that they provide. We tested our models in two challeng-

ing publicly available datasets and the experimental evalua-

tions demonstrate that the proposed architectures have out-

performed the current state-of-the-art and generate human

like driving behaviour with the aid of long term modelling.

References

[1] M. Aly. Real time detection of lane markers in urban streets.

In IEEE Intelligent Vehicles Symposium, pages 7–12. IEEE,

2008.

[2] M. Bojarski, D. Del Testa, D. Dworakowski, B. Firner,

B. Flepp, P. Goyal, L. D. Jackel, M. Monfort, U. Muller,

J. Zhang, et al. End to end learning for self-driving cars.

arXiv preprint arXiv:1604.07316, 2016.

[3] C. Chen, A. Seff, A. Kornhauser, and J. Xiao. Deepdriving:

Learning affordance for direct perception in autonomous

driving. In ICCV, pages 2722–2730, 2015.

[4] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and

A. L. Yuille. Semantic image segmentation with deep con-

volutional nets and fully connected crfs. arXiv preprint

arXiv:1412.7062, 2014.

[5] Q. Chen, X. Zhu, Z. Ling, S. Wei, and H. Jiang. Enhancing

and combining sequential and tree lstm for natural language

inference. arXiv preprint arXiv:1609.06038, 2016.

[6] X. Chen and C. Lawrence Zitnick. Mind’s eye: A recurrent

visual representation for image caption generation. In CVPR,

pages 2422–2431, 2015.

[7] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler,

R. Benenson, U. Franke, S. Roth, and B. Schiele. The

cityscapes dataset for semantic urban scene understanding.

In CVPR, pages 3213–3223, 2016.

[8] P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D. Ra-

manan. Object detection with discriminatively trained part-

based models. IEEE T-PAMI, 32(9):1627–1645, 2010.

[9] T. Fernando, S. Denman, A. McFadyen, S. Sridharan, and

C. Fookes. Tree memory networks for modelling long-term

temporal dependencies. arXiv preprint arXiv:1703.04706,

2017.

[10] A. Geiger, P. Lenz, C. Stiller, and R. Urtasun. Vision meets

robotics: The kitti dataset. The International Journal of

Robotics Research, 32(11):1231–1237, 2013.

[11] J. F. Heyse and M. Bouton. End-to-end driving controls pre-

dictions from images. 2016.

[12] S. Hochreiter and J. Schmidhuber. Long short-term memory.

Neural computation, 9(8):1735–1780, 1997.

[13] A. Joulin and T. Mikolov. Inferring algorithmic patterns with

stack-augmented recurrent nets. In Advances in Neural In-

formation Processing Systems, pages 190–198, 2015.

[14] Ł. Kaiser and I. Sutskever. Neural gpus learn algorithms.


[15] A. Kumar, O. Irsoy, J. Su, J. Bradbury, R. English, B. Pierce,

P. Ondruska, I. Gulrajani, and R. Socher. Ask me anything:

Dynamic memory networks for natural language processing.


[16] P. Lenz, J. Ziegler, A. Geiger, and M. Roser. Sparse scene

flow segmentation for moving object detection in urban en-

vironments. In IEEE Intelligent Vehicles Symposium, pages

926–932. IEEE, 2011.

[17] M. Malinowski and M. Fritz. A multi-world approach to

question answering about real-world scenes based on uncer-

tain input. In Advances in Neural Information Processing

Systems, pages 1682–1690, 2014.

[18] T. Munkhdalai and H. Yu. Neural tree indexers for text un-

derstanding. arXiv preprint arXiv:1607.04492, 2016.

[19] D. Neil, J. H. Lee, T. Delbruck, and S.-C. Liu. Delta net-

works for optimized recurrent network computation. arXiv

preprint arXiv:1612.05571, 2016.

[20] P. Penkov, V. Sriram, and J. Ye. Applying techniques in

supervised deep learning to steering angle prediction in au-

tonomous vehicles. 2016.

[21] D. A. Pomerleau. Alvinn, an autonomous land vehicle in a

neural network. Technical report, Carnegie Mellon Univer-

sity, Computer Science Department, 1989.

[22] D. A. Pomerleau. Neural network perception for mobile

robot guidance, volume 239. Springer Science & Business

Media, 2012.

[23] J. Weston, S. Chopra, and A. Bordes. Memory networks.


[24] H. Xu, Y. Gao, F. Yu, and T. Darrell. End-to-end learning

of driving models from large-scale video datasets. arXiv

preprint arXiv:1612.01079, 2016.

[25] F. Yu and V. Koltun. Multi-scale context aggregation by di-

lated convolutions. arXiv preprint arXiv:1511.07122, 2015.

[26] X.-D. Zhu, P. Sobhani, and H. Guo. Long short-term mem-

ory over recursive structures. In ICML, pages 1604–1612,

2015.

221

going deeper: autonomous steering with neural memory...

Documents