journal of la visual tracking via dynamic memory networks · journal of latex class files, vol. 14,...

14
JOURNAL OF L A T E X CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 1 Visual Tracking via Dynamic Memory Networks Tianyu Yang and Antoni B. Chan Abstract—Template-matching methods for visual tracking have gained popularity recently due to their good performance and fast speed. However, they lack effective ways to adapt to changes in the target object’s appearance, making their tracking accuracy still far from state-of-the-art. In this paper, we propose a dynamic memory network to adapt the template to the target’s appearance variations during tracking. The reading and writing process of the external memory is controlled by an LSTM network with the search feature map as input. A spatial attention mechanism is applied to concentrate the LSTM input on the potential target as the location of the target is at first unknown. To prevent aggressive model adaptivity, we apply gated residual template learning to control the amount of retrieved memory that is used to combine with the initial template. In order to alleviate the drift problem, we also design a “negative” memory unit that stores templates for distractors, which are used to cancel out wrong responses from the object template. To further boost the tracking performance, an auxiliary classification loss is added after the feature extractor part. Unlike tracking-by-detection methods where the object’s information is maintained by the weight parameters of neural networks, which requires expensive online fine-tuning to be adaptable, our tracker runs completely feed-forward and adapts to the target’s appearance changes by updating the external memory. Moreover, the capacity of our model is not determined by the network size as with other trackers – the capacity can be easily enlarged as the memory requirements of a task increase, which is favorable for memorizing long-term object information. Extensive experiments on the OTB and VOT datasets demonstrate that our trackers perform favorably against state-of-the-art tracking methods while retaining real-time speed. Index Terms—Dynamic Memory Networks, Spatial Attention, Gated Residual Template Learning, Distractor Template Canceling 1 I NTRODUCTION R ECENT years have witnessed the great success of con- volution neural networks (CNNs) applied to image recognition [1], [2], [3], object detection [4], [5], [6] and semantic segmentation [7], [8], [9]. The visual tracking com- munity also sees an increasing number of trackers [10], [11], [12], [13], [14] adopting deep learning models to boost their performance. Among them are two dominant tracking strategies. One is the tracking-by-detection scheme that online trains an object appearance classifier [10], [11] to distin- guish the target from the background. The model is first learned using the initial frame, and then fine-tuned using the training samples generated in the subsequent frames based on the newly predicted bounding box. The other scheme is template matching, which adopts either the target patch in the first frame [13], [15] or the previous frame [16] to construct the matching model. To handle changes in the target appearance, the template built in the first frame may be interpolated by the recently generated object template with a small learning rate [17]. The main difference between these two strategies is that tracking-by-detection maintains the target’s appearance information in the weights of the deep neural network, thus requiring online fine-tuning with stochastic gradient descent (SGD) to make the model adaptable, while in con- trast, template matching stores the target’s appearance in the object template, which is generated by feed forward computations. Due to the computationally expensive model updating required in tracking-by-detection, the speed of such methods are usually slow, e.g. [10], [11], [18] run at about 1 fps, although they do achieve state-of-the-art track- ing accuracy. Template matching methods, however, are fast because there is no need to update the parameters of the The authors are with the Department of Computer Science, City Univer- sity of Hong Kong. E-mail: [email protected], [email protected] neural networks. Recently, several trackers [13], [14], [19], [20], [21] adopt fully convolutional Siamese networks as the matching model, which demonstrate promising results and real-time speed. However, there is still a large performance gap between template-matching models and tracking-by- detection, due to the lack of an effective method for adapting to appearance variations online. In this paper, we propose a dynamic memory network, where the target information is stored and recalled from ex- ternal memory, to maintain the variations of object appear- ance for template-matching (See an example in Figure 1). Unlike tracking-by-detection where the target’s information is stored in the weights of neural networks and therefore the capacity of the model is fixed by the number of parameters, the model capacity of our memory networks can be easily enlarged by increasing the size of external memory, which is useful for memorizing long-term appearance variations. Since aggressive template updating is prone to overfit recent frames and the initial template is the most reliable one, we use the initial template as a conservative reference of the object and a residual template, obtained from retrieved memory, to adapt to the appearance variations. During tracking, the residual template is gated channel-wise and combined with the initial template to form the positive matching template. The channel-wise gating of the residual template controls how much each channel of the retrieved template should be added to the initial template, which can be interpreted as a feature/part selector for adapting the template. Besides the positive template, a second “nega- tive” memory unit stores templates of potential distractor objects. The negative template is used to cancel out non- discriminative channels (corresponding to object parts) in the positive template, yielding the final template, which is convolved with the search image features to get the response map. The reading and writing process of the positive and negative memories, as well as the channel-wise gate vector arXiv:1907.07613v3 [cs.CV] 29 Nov 2019

Upload: others

Post on 05-Aug-2020

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: JOURNAL OF LA Visual Tracking via Dynamic Memory Networks · JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 1 Visual Tracking via Dynamic Memory Networks Tianyu Yang and

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 1

Visual Tracking via Dynamic Memory NetworksTianyu Yang and Antoni B. Chan

Abstract—Template-matching methods for visual tracking have gained popularity recently due to their good performance and fastspeed. However, they lack effective ways to adapt to changes in the target object’s appearance, making their tracking accuracy still farfrom state-of-the-art. In this paper, we propose a dynamic memory network to adapt the template to the target’s appearance variationsduring tracking. The reading and writing process of the external memory is controlled by an LSTM network with the search feature mapas input. A spatial attention mechanism is applied to concentrate the LSTM input on the potential target as the location of the target isat first unknown. To prevent aggressive model adaptivity, we apply gated residual template learning to control the amount of retrievedmemory that is used to combine with the initial template. In order to alleviate the drift problem, we also design a “negative” memory unitthat stores templates for distractors, which are used to cancel out wrong responses from the object template. To further boost thetracking performance, an auxiliary classification loss is added after the feature extractor part. Unlike tracking-by-detection methodswhere the object’s information is maintained by the weight parameters of neural networks, which requires expensive online fine-tuningto be adaptable, our tracker runs completely feed-forward and adapts to the target’s appearance changes by updating the externalmemory. Moreover, the capacity of our model is not determined by the network size as with other trackers – the capacity can be easilyenlarged as the memory requirements of a task increase, which is favorable for memorizing long-term object information. Extensiveexperiments on the OTB and VOT datasets demonstrate that our trackers perform favorably against state-of-the-art tracking methodswhile retaining real-time speed.

Index Terms—Dynamic Memory Networks, Spatial Attention, Gated Residual Template Learning, Distractor Template Canceling

F

1 INTRODUCTION

R ECENT years have witnessed the great success of con-volution neural networks (CNNs) applied to image

recognition [1], [2], [3], object detection [4], [5], [6] andsemantic segmentation [7], [8], [9]. The visual tracking com-munity also sees an increasing number of trackers [10],[11], [12], [13], [14] adopting deep learning models to boosttheir performance. Among them are two dominant trackingstrategies. One is the tracking-by-detection scheme that onlinetrains an object appearance classifier [10], [11] to distin-guish the target from the background. The model is firstlearned using the initial frame, and then fine-tuned usingthe training samples generated in the subsequent framesbased on the newly predicted bounding box. The otherscheme is template matching, which adopts either the targetpatch in the first frame [13], [15] or the previous frame [16]to construct the matching model. To handle changes in thetarget appearance, the template built in the first frame maybe interpolated by the recently generated object templatewith a small learning rate [17].

The main difference between these two strategies isthat tracking-by-detection maintains the target’s appearanceinformation in the weights of the deep neural network,thus requiring online fine-tuning with stochastic gradientdescent (SGD) to make the model adaptable, while in con-trast, template matching stores the target’s appearance inthe object template, which is generated by feed forwardcomputations. Due to the computationally expensive modelupdating required in tracking-by-detection, the speed ofsuch methods are usually slow, e.g. [10], [11], [18] run atabout 1 fps, although they do achieve state-of-the-art track-ing accuracy. Template matching methods, however, are fastbecause there is no need to update the parameters of the

• The authors are with the Department of Computer Science, City Univer-sity of Hong Kong.E-mail: [email protected], [email protected]

neural networks. Recently, several trackers [13], [14], [19],[20], [21] adopt fully convolutional Siamese networks as thematching model, which demonstrate promising results andreal-time speed. However, there is still a large performancegap between template-matching models and tracking-by-detection, due to the lack of an effective method for adaptingto appearance variations online.

In this paper, we propose a dynamic memory network,where the target information is stored and recalled from ex-ternal memory, to maintain the variations of object appear-ance for template-matching (See an example in Figure 1).Unlike tracking-by-detection where the target’s informationis stored in the weights of neural networks and therefore thecapacity of the model is fixed by the number of parameters,the model capacity of our memory networks can be easilyenlarged by increasing the size of external memory, whichis useful for memorizing long-term appearance variations.Since aggressive template updating is prone to overfit recentframes and the initial template is the most reliable one,we use the initial template as a conservative reference ofthe object and a residual template, obtained from retrievedmemory, to adapt to the appearance variations. Duringtracking, the residual template is gated channel-wise andcombined with the initial template to form the positivematching template. The channel-wise gating of the residualtemplate controls how much each channel of the retrievedtemplate should be added to the initial template, which canbe interpreted as a feature/part selector for adapting thetemplate. Besides the positive template, a second “nega-tive” memory unit stores templates of potential distractorobjects. The negative template is used to cancel out non-discriminative channels (corresponding to object parts) inthe positive template, yielding the final template, which isconvolved with the search image features to get the responsemap. The reading and writing process of the positive andnegative memories, as well as the channel-wise gate vector

arX

iv:1

907.

0761

3v3

[cs

.CV

] 2

9 N

ov 2

019

Page 2: JOURNAL OF LA Visual Tracking via Dynamic Memory Networks · JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 1 Visual Tracking via Dynamic Memory Networks Tianyu Yang and

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 2

#90 #270 #460 #600 #690#1

0 100 200 300 400 500 600 700 800frame number

0

0.2

0.4

0.6

0.8

1

con

tro

l ga

tes

skip gate gs

read gate gr

alloc gate ga

Fig. 1. Example of template updating on the Basketball video: the control gate signals change along with the appearance variations. When thereare large appearance changes, the allocation gate approaches to 1, which means a new memory slot is overwritten. When there are only smallappearance variations in the object template, then the read gate is close to 1, which indicates that the most recently read memory slot will beupdated. See Section 3.7 for detailed explanations.

for the residual template, is controlled by an LSTM (LongShort-Term Memory) whose input is based on the searchfeature map. As the target position is at first unknown in thesearch image, we adopt an attention mechanism to locatethe object roughly in the search image, thus leading to asoft representation of the target for the input to the LSTMcontroller. This helps to retrieve the most-related templatein the memory. In addition, we further improve the trackingperformance by adding an auxiliary classification loss at theend of the CNN feature extractor, which is aimed at improv-ing the tracker’s robustness to appearance variations. Thetracking and classification losses serve complementary roles– learning features through similarity matching facilitatestheir ability of precise localization, while training featureson the auxiliary classification problem provides semanticinformation for tracking robustness. The whole frameworkis differentiable and therefore can be trained end-to-endwith SGD. In summary, the contributions of our work are:

• We design a dynamic memory network for visual track-ing. An external memory block, which is controlled byan LSTM with attention mechanism, allows adaptationto appearance variations.

• We propose gated residual template learning to gener-ate the final matching template, which effectively con-trols the amount of appearance variations in retrievedmemory that is added to each channel of the initialmatching template. This prevents excessive model up-dating, while retaining the conservative information ofthe target.

• We propose a negative template memory for storingand retrieving distractor templates, which are used tocancel the response peaks due to distractor objects, thusalleviating drift problems caused by distractors.

• We add an auxiliary classification branch after the fea-ture extraction block, which trains the features to alsocontain semantic information. This increases the robust-ness of the features to variations in object appearances,and boosts the tracking performance.

• We extensively evaluate our algorithm on large scaledatasets OTB and VOT. Our trackers perform favorablyagainst state-of-the-art tracking methods while possess-

ing real-time speed.The remainder of the paper is organized as follows. In

Section 2, we briefly review related work. In Section 3, wedescribe our proposed tracking methods, and in Section 4we present implementation details. We perform extensiveexperiments on OTB and VOT datasets in Section 5.

2 RELATED WORK

In this section, we review related work on tracking-by-detection, tracking by template-matching, memory net-works and multi-task learning. A preliminary version of ourwork appears in ECCV 2018 [22]. This paper contains addi-tional improvements in both methodology and experiments,including: 1) we propose a negative memory unit that storesdistractor templates to cancel out wrong responses from theobject template; 2) we design an auxiliary classification lossto facilitate the tracker’s robustness to appearance changes;3) we conduct comprehensive experiments on the VOTdatasets, including VOT-2015, VOT-2016 and VOT-2017.

2.1 Tracking by DetectionTracking-by-detection treats object tracking as a detectionproblem within an ROI image, where an online learnedclassifier is used to distinguish the target from the back-ground. The difficulty of updating the classifier to adapt toappearance variations is that the bounding box predicted oneach frame may not be accurate, which produces degradedtraining samples and thus gradually causes the tracker todrift. Numerous algorithms have been designed to miti-gate the sample ambiguity caused by inaccurate predictedbounding boxes. [23] formulates the online model learningprocess in a semi-supervised fashion by combining a givenprior and the trained classifier. [24] proposes a multipleinstance learning scheme to solve the problem of inaccurateexamples for online training. Instead of only focusing onfacilitating the training process of the tracker, [25] decom-poses the tracking task into three parts—tracking, learningand detection, where a optical flow tracker is used for frame-to-frame tracking and an online trained detector is adoptedto re-detect the target when drifting occurs.

Page 3: JOURNAL OF LA Visual Tracking via Dynamic Memory Networks · JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 1 Visual Tracking via Dynamic Memory Networks Tianyu Yang and

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 3

With the widespread use of CNNs in the computervision community, many methods [26] have applied CNNsas the classifier to localize the target. [12] uses two fully con-volutional neural networks to estimate the target’s bound-ing box, including a GNet that captures category informa-tion and an SNet that classifies the target from the back-ground. [11] presents a multi-domain learning frameworkto learn the shared representation of objects from differ-ent sequences. Motived by Dropout [27], BranchOut [28]adopts multiple branches of fully connected layers, fromwhich a random subset are selected for training, whichregularizes the neural networks to avoid overfitting. Unlikethese tracking-by-detection algorithms, which need costlystochastic gradient decent (SGD) updating, our methodruns completely feed-forward and adapts to the object’sappearance variations through a memory writing process,thus achieving real-time performance.

2.2 Tracking by Template-Matching

Matching-based methods have recently gained popularitydue to their fast speed and promising performance. Themost notable is the fully convolutional Siamese network(SiamFC) [13]. Although it only uses the first frame asthe template, SiamFC achieves competitive results and fastspeed. The key deficiency of SiamFC is that it lacks an effec-tive model for online updating. To address this, [17] updatesthe model using linear interpolation of new templates witha small learning rate, but only sees modest improvementsin accuracy. RFL (Recurrent Filter Learning) [19] adopts aconvolutional LSTM for model updating, where the forgetand input gates control the linear combination of the his-torical target information (i.e., memory states of the LSTM)and the object’s current template automatically. Guo et al.[14] propose a dynamic Siamese network with two generaltransformations for target appearance variation and back-ground suppression. He et. al. [20] design two branches ofSiamese networks with a channel-wise attention mechanismaiming to improve the robustness and discrimination abilityof the matching network.

To further improve the speed of SiamFC, [29] reduces thefeature computation cost for easy frames, by using deep re-inforcement learning to train policies for early stopping thefeed-forward calculations of the CNN when the responseconfidence is high enough. SINT [15] also uses Siamesenetworks for visual tracking and has higher accuracy, butruns much slower than SiamFC (2 fps vs 86 fps) due tothe use of a deeper CNN (VGG16) for feature extraction,and optical flow for its candidate sampling strategy. [30]proposes a dual deep network by exploiting hierarchicalfeatures of CNN layers for object tracking. Unlike othertemplate-matching models that use sliding windows orrandom sampling to generate candidate image patches fortesting, GOTURN [16] directly regresses the coordinates ofthe target’s bounding box by comparing the previous andcurrent image patches. Despite its fast speed and advantageon handling scale and aspect ratio changes, its trackingaccuracy is much lower than other state-of-the-art trackers.

Different from existing matching-based trackers wherethe capacity to adapt is limited by the neural network size,we use SiamFC as the baseline feature extractor and add anaddressable memory, whose memory size is independent

of the neural networks and thus can be easily enlarged asmemory requirements of a tracking task increase.

2.3 Memory NetworksThe recent use of convolutional LSTM for visual tracking[19] shows that memory states are useful for object templatemanagement over long timescales. Memory networks aretypically used to solve simple logical reasoning problem innatural language processing (NLP), e.g., question answer-ing and sentiment analysis. The pioneering works includeNTM (Neural Turing Machine) [31] and MemNN (MemoryNeural Networks) [32]. They both propose an addressableexternal memory with reading and writing mechanisms– NTM focuses on problems of sorting, copying and re-call, while MemNN aims at language and reasoning tasks.MemN2N [33] further improves MemNN by removing thesupervision of supporting facts, which makes it trainable inan end-to-end fashion. Based on NTM, [34] proposes DNC(Differentiable Neural Computer), which uses a differentaccess mechanism to alleviate the memory overlap andinterference problems. Recently, NTM is also applied to one-shot learning [35] by redesigning the method for readingand writing memory, and has shown promising results atencoding and retrieving new information quickly. [36] alsoproposes a memory-augmented tracking algorithm, whichobtains limited performance and lower speed (5 fps) dueto two reasons. First, in contrast to our method, they per-forms dimensionality reduction of the object template (from20x20x256 to 256) when storing it into memory, resulting inloss of spatial information for template matching. Second,they extract multiple patches centered on different positionsof the search image to retrieve the proper memory, which isnot efficient compared with our attention scheme.

Our proposed memory model differs from the aforemen-tioned memory networks in the following aspects. First, forthe question answering problem, the input of each timestep is a sentence, i.e., a sequence of feature vectors (eachword corresponds to one vector) that needs an embeddinglayer (usually RNN) to obtain an internal state. In contrast,for object tracking, the input is a search image that needsa feature extraction process (usually CNN) to get a moreabstract representation. Furthermore, for object tracking, thetarget’s position in the search image patch is unknown,and here we propose an attention mechanism to highlightthe target’s information when generating the read key formemory retrieval. Second, the dimension of feature vec-tors stored in memory for NLP is relatively small (50 inMemN2N vs. 6×6×256=9216 in our case). Directly using theoriginal template for address calculation is time-consuming.Therefore we apply an average pooling on the feature mapto generate a template key for addressing, which is effi-cient and effective experimentally. Furthermore, we applychannel-wise gated residual template learning for modelupdating, and redesign the memory writing operation tobe more suitable for visual tracking.

2.4 Multi-task learningMulti-task learning has been successfully used in manyapplications of machine learning, ranging from natural lan-guage processing [37] and speech recognition [38] to com-

Page 4: JOURNAL OF LA Visual Tracking via Dynamic Memory Networks · JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 1 Visual Tracking via Dynamic Memory Networks Tianyu Yang and

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 4

A"en%on-basedLSTMController

PosWrite

ResTemplate

InitTemplate

SearchImage St

Frame It

ObjectImage Ot

NegMemoryMnt

<latexit sha1_base64="/sVAVAXh5wvrYdvTUcS1NtxM5BY=">AAACBnicbVA9SwNBEJ2LX/H8ilqJzWEQrI47Gy0DWtgoEcwHJGfY2+wlS/b2jt05MRzpbW31P9iJrX/Dztpf4Sax0MQHA4/3ZpiZF6aCa/S8D6uwsLi0vFJctdfWNza3Sts7dZ1kirIaTUSimiHRTHDJashRsGaqGIlDwRrh4GzsN+6Y0jyRNzhMWRCTnuQRpwSN1Ljs5Di6lZ1S2XO9CZx54v+QcsU93/sEgGqn9NXuJjSLmUQqiNYt30sxyIlCTgUb2e1Ms5TQAemxlqGSxEwH+eTckXNolK4TJcqURGei/p7ISaz1MA5NZ0ywr2e9sfiv11Mk7XN6P7Mfo9Mg5zLNkEk6XR9lwsHEGSfidLliFMXQEEIVNx84tE8UoWhys23bhOPPRjFP6seu77n+tUnpCqYowj4cwBH4cAIVuIAq1IDCAB7hCZ6tB+vFerXepq0F62dmF/7Aev8G4TiayA==</latexit><latexit sha1_base64="LEGIS/cX4LRb+W0TP9PphU0gUyM=">AAACBnicbVC5TsNAEB1zBnOFo0E0FhESVWTTQBkJChpQkMghJSZab9bJKuu1tTtGRFZ6Wlr4BzpEy29Q0/EHdGyOAhKeNNLTezOamRckgmt03Q9rbn5hcWk5t2Kvrq1vbOa3tqs6ThVlFRqLWNUDopngklWQo2D1RDESBYLVgt7Z0K/dMaV5LG+wnzA/Ih3JQ04JGql22cpwcCtb+YJbdEdwZok3IYVS8XzvE3a/y638V7Md0zRiEqkgWjc8N0E/Iwo5FWxgN1PNEkJ7pMMahkoSMe1no3MHzqFR2k4YK1MSnZH6eyIjkdb9KDCdEcGunvaG4r9eR5Gky+n91H4MT/2MyyRFJul4fZgKB2NnmIjT5opRFH1DCFXcfODQLlGEosnNtm0TjjcdxSypHhc9t+hdm5SuYIwc7MMBHIEHJ1CCCyhDBSj04BGe4Nl6sF6sV+tt3DpnTWZ24A+s9x9lQ5vY</latexit><latexit sha1_base64="LEGIS/cX4LRb+W0TP9PphU0gUyM=">AAACBnicbVC5TsNAEB1zBnOFo0E0FhESVWTTQBkJChpQkMghJSZab9bJKuu1tTtGRFZ6Wlr4BzpEy29Q0/EHdGyOAhKeNNLTezOamRckgmt03Q9rbn5hcWk5t2Kvrq1vbOa3tqs6ThVlFRqLWNUDopngklWQo2D1RDESBYLVgt7Z0K/dMaV5LG+wnzA/Ih3JQ04JGql22cpwcCtb+YJbdEdwZok3IYVS8XzvE3a/y638V7Md0zRiEqkgWjc8N0E/Iwo5FWxgN1PNEkJ7pMMahkoSMe1no3MHzqFR2k4YK1MSnZH6eyIjkdb9KDCdEcGunvaG4r9eR5Gky+n91H4MT/2MyyRFJul4fZgKB2NnmIjT5opRFH1DCFXcfODQLlGEosnNtm0TjjcdxSypHhc9t+hdm5SuYIwc7MMBHIEHJ1CCCyhDBSj04BGe4Nl6sF6sV+tt3DpnTWZ24A+s9x9lQ5vY</latexit><latexit sha1_base64="LEGIS/cX4LRb+W0TP9PphU0gUyM=">AAACBnicbVC5TsNAEB1zBnOFo0E0FhESVWTTQBkJChpQkMghJSZab9bJKuu1tTtGRFZ6Wlr4BzpEy29Q0/EHdGyOAhKeNNLTezOamRckgmt03Q9rbn5hcWk5t2Kvrq1vbOa3tqs6ThVlFRqLWNUDopngklWQo2D1RDESBYLVgt7Z0K/dMaV5LG+wnzA/Ih3JQ04JGql22cpwcCtb+YJbdEdwZok3IYVS8XzvE3a/y638V7Md0zRiEqkgWjc8N0E/Iwo5FWxgN1PNEkJ7pMMahkoSMe1no3MHzqFR2k4YK1MSnZH6eyIjkdb9KDCdEcGunvaG4r9eR5Gky+n91H4MT/2MyyRFJul4fZgKB2NnmIjT5opRFH1DCFXcfODQLlGEosnNtm0TjjcdxSypHhc9t+hdm5SuYIwc7MMBHIEHJ1CCCyhDBSj04BGe4Nl6sF6sV+tt3DpnTWZ24A+s9x9lQ5vY</latexit><latexit sha1_base64="t/77VNgEbC1Hds5lKspTH3pn0bE=">AAACBnicbVA9SwNBEJ2LX/H8ilraLAbBKtzZaBmwsVEimA9IzrC32SRL9vaO3TkxHOltbfU/2Imtf8O/4K9wk1yhiQ8GHu/NMDMvTKQw6HlfTmFldW19o7jpbm3v7O6V9g8aJk4143UWy1i3Qmq4FIrXUaDkrURzGoWSN8PR5dRvPnBtRKzucJzwIKIDJfqCUbRS87qb4eRedUtlr+LNQJaJn5My5Kh1S9+dXszSiCtkkhrT9r0Eg4xqFEzyidtJDU8oG9EBb1uqaMRNkM3OnZATq/RIP9a2FJKZ+nsio5Ex4yi0nRHFoVn0puK/3kDTZCjY48J+7F8EmVBJilyx+fp+KgnGZJoI6QnNGcqxJZRpYT8gbEg1ZWhzc13XhuMvRrFMGmcV36v4t165epPHVIQjOIZT8OEcqnAFNagDgxE8wwu8Ok/Om/PufMxbC04+cwh/4Hz+AHkfmR0=</latexit>

PosReadKey PosMemKeys

Read

NegReadKey NegMemKeys

Read

PosTemplate

NegTemplate

xResidualGate

TemplateCanceling FinalTemplate

PosMemoryMpt

<latexit sha1_base64="zoUh/opZ3phRtf6xDuv03AK0SL8=">AAACBnicbVC7TgMxENwLr3C8ApQ0hgiJ6nRHA2UkGhpQkMhDSo7I5/gSKz6fZfsQ0Sk9LS38Ax2i5Tf4BQq+AedRQMJIK41mdrW7E0nOtPH9T6ewtLyyulZcdzc2t7Z3Srt7dZ1mitAaSXmqmhHWlDNBa4YZTptSUZxEnDaiwcXYb9xTpVkqbs1Q0jDBPcFiRrCxUuOqk5vRneyUyr7nT4AWSTAj5Yp3+I0AoNopfbW7KckSKgzhWOtW4EsT5lgZRjgdue1MU4nJAPdoy1KBE6rDfHLuCB1bpYviVNkSBk3U3xM5TrQeJpHtTLDp63lvLP7r9RSWfUYe5vab+DzMmZCZoYJM18cZRyZF40RQlylKDB9agoli9gNE+lhhYmxuruvacIL5KBZJ/dQLfC+4sSldwxRFOIAjOIEAzqACl1CFGhAYwBM8w4vz6Lw6b877tLXgzGb24Q+cjx/Q35q8</latexit><latexit sha1_base64="Rky9skAxX8Uw94tzf3REsIsFimE=">AAACBnicbVC7SgNBFJ2Nr7i+opY2o1GwWnYF0TJgY6NEMA9IYpidzCZDZmeHmbtiWNLb2uo/2Imtv+EvWPgNTh6FJh64cDjnXu69J1SCG/D9Tye3sLi0vJJfddfWNza3Cts7VZOkmrIKTUSi6yExTHDJKsBBsLrSjMShYLWwfzHya/dMG57IWxgo1opJV/KIUwJWql21MxjeqXah6Hv+GHieBFNSLHn734enXJbbha9mJ6FpzCRQQYxpBL6CVkY0cCrY0G2mhilC+6TLGpZKEjPTysbnDvGRVTo4SrQtCXis/p7ISGzMIA5tZ0ygZ2a9kfiv19VE9Th9mNkP0Xkr41KlwCSdrI9SgSHBo0Rwh2tGQQwsIVRz+wGmPaIJBZub67o2nGA2inlSPfEC3wtubErXaII82kMH6BgF6AyV0CUqowqiqI+e0DN6cR6dV+fNeZ+05pzpzC76A+fjB1Wsm8w=</latexit><latexit sha1_base64="Rky9skAxX8Uw94tzf3REsIsFimE=">AAACBnicbVC7SgNBFJ2Nr7i+opY2o1GwWnYF0TJgY6NEMA9IYpidzCZDZmeHmbtiWNLb2uo/2Imtv+EvWPgNTh6FJh64cDjnXu69J1SCG/D9Tye3sLi0vJJfddfWNza3Cts7VZOkmrIKTUSi6yExTHDJKsBBsLrSjMShYLWwfzHya/dMG57IWxgo1opJV/KIUwJWql21MxjeqXah6Hv+GHieBFNSLHn734enXJbbha9mJ6FpzCRQQYxpBL6CVkY0cCrY0G2mhilC+6TLGpZKEjPTysbnDvGRVTo4SrQtCXis/p7ISGzMIA5tZ0ygZ2a9kfiv19VE9Th9mNkP0Xkr41KlwCSdrI9SgSHBo0Rwh2tGQQwsIVRz+wGmPaIJBZub67o2nGA2inlSPfEC3wtubErXaII82kMH6BgF6AyV0CUqowqiqI+e0DN6cR6dV+fNeZ+05pzpzC76A+fjB1Wsm8w=</latexit><latexit sha1_base64="Rky9skAxX8Uw94tzf3REsIsFimE=">AAACBnicbVC7SgNBFJ2Nr7i+opY2o1GwWnYF0TJgY6NEMA9IYpidzCZDZmeHmbtiWNLb2uo/2Imtv+EvWPgNTh6FJh64cDjnXu69J1SCG/D9Tye3sLi0vJJfddfWNza3Cts7VZOkmrIKTUSi6yExTHDJKsBBsLrSjMShYLWwfzHya/dMG57IWxgo1opJV/KIUwJWql21MxjeqXah6Hv+GHieBFNSLHn734enXJbbha9mJ6FpzCRQQYxpBL6CVkY0cCrY0G2mhilC+6TLGpZKEjPTysbnDvGRVTo4SrQtCXis/p7ISGzMIA5tZ0ygZ2a9kfiv19VE9Th9mNkP0Xkr41KlwCSdrI9SgSHBo0Rwh2tGQQwsIVRz+wGmPaIJBZub67o2nGA2inlSPfEC3wtubErXaII82kMH6BgF6AyV0CUqowqiqI+e0DN6cR6dV+fNeZ+05pzpzC76A+fjB1Wsm8w=</latexit><latexit sha1_base64="seAPtVuGDkE8go5Ii34+jO6anqo=">AAACBnicbVA9SwNBEJ2LX/H8ilraHAbBKtzZaBmwsVEimA9IzrC32SRL9vaW3TkxHOltbfU/2Imtf8O/4K9wk1yhiQ8GHu/NMDMvUoIb9P0vp7Cyura+Udx0t7Z3dvdK+wcNk6SasjpNRKJbETFMcMnqyFGwltKMxJFgzWh0OfWbD0wbnsg7HCsWxmQgeZ9TglZqXncznNyrbqnsV/wZvGUS5KQMOWrd0nenl9A0ZhKpIMa0A19hmBGNnAo2cTupYYrQERmwtqWSxMyE2ezciXdilZ7XT7Qtid5M/T2RkdiYcRzZzpjg0Cx6U/Ffb6CJGnL6uLAf+xdhxqVKkUk6X99PhYeJN03E63HNKIqxJYRqbj/w6JBoQtHm5rquDSdYjGKZNM4qgV8Jbv1y9SaPqQhHcAynEMA5VOEKalAHCiN4hhd4dZ6cN+fd+Zi3Fpx85hD+wPn8AXxXmR8=</latexit>

+

NegWriteNegTemplateExtrac%on

ResponseMap

SearchFeature

Posi%veMemoryReading

Nega%veMemoryReading Nega%veMemoryWri%ng

Posi%veMemoryWri%ng

ConvLayers

ConvLayers

SearchImage St

Fig. 2. The pipeline of our tracking algorithm. The green rectangle is the candidate region for target searching. The Feature Extraction blocks for theobject image and search image share the same architecture and parameters. An attentional LSTM extracts the target’s information on the searchfeature map, which guides the memory reading process to retrieve a matching template. The residual template is combined with the initial template,to obtain a positive template. A negative template is read from the negative memory and combined with the positive template to cancel responsesfrom distractor objects. The final template is convolved with the search feature map to obtain the response map. The newly predicted bounding boxis then used to crop the object’s feature map for writing to the positive memory. A negative template is extracted from the search feature map usingthe response score and written to negative memory.

puter vision [39]. [40] estimates the street direction in an au-tonomous driving car by predicting various characteristicsof the road, which serves as an auxiliary task. [41] introducesauxiliary tasks of estimating head pose and facial attributesto boost the performance of facial landmark detection, while[42] boosted the performance of a human pose estimationnetwork by adding human joint detectors as auxiliary tasks.Recent works combining object detection and semanticsegmentation [43], [44], as well as image depth estima-tion and semantic segmentation [45], [46], also demonstratethe effectiveness of multi-task learning on improving thegeneralization ability of neural networks. Observing thatthe CNN learned for object similarity matching lacks thegeneralization ability of invariance to appearance variations,we propose to add an auxiliary task, object classification, toregularize the CNN so that it learns object semantics.

3 DYNAMIC MEMORY NETWORKS FOR TRACKING

In this section, we propose a dynamic memory network withreading and writing mechanisms for visual tracking. Thewhole framework is shown in Figure 2. Given the searchimage, first features are extracted with a CNN. The imagefeatures are input into an attentional LSTM, which controlsmemory reading and writing. A residual template is readfrom the positive memory and combined with the initialtemplate learned from the first frame, forming the positivetemplate. Then a negative template is retrieved from the

negative memory to cancel parts of the positive templatethrough a channel-wise gate, forming the final template. Thefinal template is convolved with the search image featuresto obtain the response map, and the target bounding boxis predicted. The new target’s template is cropped usingthe predicted bounding box, features are extracted and thenwritten into positive memory for model updating. The neg-ative template is extracted on the search feature map basedon the response map. Responses whose corresponding scoreis greater than a threshold and whose distance are far fromthe target’s center are considered as negative (distractor)templates for negative memory writing.

3.1 Feature ExtractionGiven an input image It at time t, we first crop the frameinto a search image patch St with a rectangle that is com-puted from the previous predicted bounding box as in [13].Then it is encoded into a high level representation f(St),which is a spatial feature map, via a fully convolutionalneural networks (FCNN). In this work we use the FCNNstructure from SiamFC [13]. After getting the predictedbounding box, we use the same feature extractor to computethe new object template for positive memory writing.

3.2 Attention SchemeSince the object information in the search image is neededto retrieve the related template for matching, but the object

Page 5: JOURNAL OF LA Visual Tracking via Dynamic Memory Networks · JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 1 Visual Tracking via Dynamic Memory Networks Tianyu Yang and

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 5

LSTM Controller

+

Xso.max

hidden state

ith feature vector

a9ended vector

tanh

ht�1<latexit sha1_base64="1Cd06XtmVwV+3AVZSC0lhTQY4ig=">AAAB+XicbVBNS8NAFHypX7V+RT16CRbBiyWRgh4LXjxWsLXQhrDZbtqlm03YfSmU0H/ixYMiXv0n3vw3btoctHVgYZh5jzc7YSq4Rtf9tiobm1vbO9Xd2t7+weGRfXzS1UmmKOvQRCSqFxLNBJesgxwF66WKkTgU7Cmc3BX+05QpzRP5iLOU+TEZSR5xStBIgW0PYoLjMMrH8yDHK28e2HW34S7grBOvJHUo0Q7sr8EwoVnMJFJBtO57bop+ThRyKti8Nsg0SwmdkBHrGypJzLSfL5LPnQujDJ0oUeZJdBbq742cxFrP4tBMFjn1qleI/3n9DKNbP+cyzZBJujwUZcLBxClqcIZcMYpiZgihipusDh0TRSiasmqmBG/1y+uke93wDH9o1lvNso4qnME5XIIHN9CCe2hDByhM4Rle4c3KrRfr3fpYjlascucU/sD6/AGbnJOU</latexit><latexit sha1_base64="1Cd06XtmVwV+3AVZSC0lhTQY4ig=">AAAB+XicbVBNS8NAFHypX7V+RT16CRbBiyWRgh4LXjxWsLXQhrDZbtqlm03YfSmU0H/ixYMiXv0n3vw3btoctHVgYZh5jzc7YSq4Rtf9tiobm1vbO9Xd2t7+weGRfXzS1UmmKOvQRCSqFxLNBJesgxwF66WKkTgU7Cmc3BX+05QpzRP5iLOU+TEZSR5xStBIgW0PYoLjMMrH8yDHK28e2HW34S7grBOvJHUo0Q7sr8EwoVnMJFJBtO57bop+ThRyKti8Nsg0SwmdkBHrGypJzLSfL5LPnQujDJ0oUeZJdBbq742cxFrP4tBMFjn1qleI/3n9DKNbP+cyzZBJujwUZcLBxClqcIZcMYpiZgihipusDh0TRSiasmqmBG/1y+uke93wDH9o1lvNso4qnME5XIIHN9CCe2hDByhM4Rle4c3KrRfr3fpYjlascucU/sD6/AGbnJOU</latexit><latexit sha1_base64="1Cd06XtmVwV+3AVZSC0lhTQY4ig=">AAAB+XicbVBNS8NAFHypX7V+RT16CRbBiyWRgh4LXjxWsLXQhrDZbtqlm03YfSmU0H/ixYMiXv0n3vw3btoctHVgYZh5jzc7YSq4Rtf9tiobm1vbO9Xd2t7+weGRfXzS1UmmKOvQRCSqFxLNBJesgxwF66WKkTgU7Cmc3BX+05QpzRP5iLOU+TEZSR5xStBIgW0PYoLjMMrH8yDHK28e2HW34S7grBOvJHUo0Q7sr8EwoVnMJFJBtO57bop+ThRyKti8Nsg0SwmdkBHrGypJzLSfL5LPnQujDJ0oUeZJdBbq742cxFrP4tBMFjn1qleI/3n9DKNbP+cyzZBJujwUZcLBxClqcIZcMYpiZgihipusDh0TRSiasmqmBG/1y+uke93wDH9o1lvNso4qnME5XIIHN9CCe2hDByhM4Rle4c3KrRfr3fpYjlascucU/sD6/AGbnJOU</latexit><latexit sha1_base64="1Cd06XtmVwV+3AVZSC0lhTQY4ig=">AAAB+XicbVBNS8NAFHypX7V+RT16CRbBiyWRgh4LXjxWsLXQhrDZbtqlm03YfSmU0H/ixYMiXv0n3vw3btoctHVgYZh5jzc7YSq4Rtf9tiobm1vbO9Xd2t7+weGRfXzS1UmmKOvQRCSqFxLNBJesgxwF66WKkTgU7Cmc3BX+05QpzRP5iLOU+TEZSR5xStBIgW0PYoLjMMrH8yDHK28e2HW34S7grBOvJHUo0Q7sr8EwoVnMJFJBtO57bop+ThRyKti8Nsg0SwmdkBHrGypJzLSfL5LPnQujDJ0oUeZJdBbq742cxFrP4tBMFjn1qleI/3n9DKNbP+cyzZBJujwUZcLBxClqcIZcMYpiZgihipusDh0TRSiasmqmBG/1y+uke93wDH9o1lvNso4qnME5XIIHN9CCe2hDByhM4Rle4c3KrRfr3fpYjlascucU/sD6/AGbnJOU</latexit>

ht<latexit sha1_base64="xMqRp4X9X9+0zLUQDbGPNTVrsdg=">AAAB9XicbVDLSsNAFL2pr1pfVZduBovgqiQi6LLgxmUF+4A2lsl00g6dTMLMjVJC/8ONC0Xc+i/u/BsnbRbaemDgcM693DMnSKQw6LrfTmltfWNzq7xd2dnd2z+oHh61TZxqxlsslrHuBtRwKRRvoUDJu4nmNAok7wSTm9zvPHJtRKzucZpwP6IjJULBKFrpoR9RHAdhNp4NMpwNqjW37s5BVolXkBoUaA6qX/1hzNKIK2SSGtPz3AT9jGoUTPJZpZ8anlA2oSPes1TRiBs/m6eekTOrDEkYa/sUkrn6eyOjkTHTKLCTeUqz7OXif14vxfDaz4RKUuSKLQ6FqSQYk7wCMhSaM5RTSyjTwmYlbEw1ZWiLqtgSvOUvr5L2Rd2z/O6y1rgs6ijDCZzCOXhwBQ24hSa0gIGGZ3iFN+fJeXHenY/FaMkpdo7hD5zPHz/AkvE=</latexit><latexit sha1_base64="xMqRp4X9X9+0zLUQDbGPNTVrsdg=">AAAB9XicbVDLSsNAFL2pr1pfVZduBovgqiQi6LLgxmUF+4A2lsl00g6dTMLMjVJC/8ONC0Xc+i/u/BsnbRbaemDgcM693DMnSKQw6LrfTmltfWNzq7xd2dnd2z+oHh61TZxqxlsslrHuBtRwKRRvoUDJu4nmNAok7wSTm9zvPHJtRKzucZpwP6IjJULBKFrpoR9RHAdhNp4NMpwNqjW37s5BVolXkBoUaA6qX/1hzNKIK2SSGtPz3AT9jGoUTPJZpZ8anlA2oSPes1TRiBs/m6eekTOrDEkYa/sUkrn6eyOjkTHTKLCTeUqz7OXif14vxfDaz4RKUuSKLQ6FqSQYk7wCMhSaM5RTSyjTwmYlbEw1ZWiLqtgSvOUvr5L2Rd2z/O6y1rgs6ijDCZzCOXhwBQ24hSa0gIGGZ3iFN+fJeXHenY/FaMkpdo7hD5zPHz/AkvE=</latexit><latexit sha1_base64="xMqRp4X9X9+0zLUQDbGPNTVrsdg=">AAAB9XicbVDLSsNAFL2pr1pfVZduBovgqiQi6LLgxmUF+4A2lsl00g6dTMLMjVJC/8ONC0Xc+i/u/BsnbRbaemDgcM693DMnSKQw6LrfTmltfWNzq7xd2dnd2z+oHh61TZxqxlsslrHuBtRwKRRvoUDJu4nmNAok7wSTm9zvPHJtRKzucZpwP6IjJULBKFrpoR9RHAdhNp4NMpwNqjW37s5BVolXkBoUaA6qX/1hzNKIK2SSGtPz3AT9jGoUTPJZpZ8anlA2oSPes1TRiBs/m6eekTOrDEkYa/sUkrn6eyOjkTHTKLCTeUqz7OXif14vxfDaz4RKUuSKLQ6FqSQYk7wCMhSaM5RTSyjTwmYlbEw1ZWiLqtgSvOUvr5L2Rd2z/O6y1rgs6ijDCZzCOXhwBQ24hSa0gIGGZ3iFN+fJeXHenY/FaMkpdo7hD5zPHz/AkvE=</latexit><latexit sha1_base64="xMqRp4X9X9+0zLUQDbGPNTVrsdg=">AAAB9XicbVDLSsNAFL2pr1pfVZduBovgqiQi6LLgxmUF+4A2lsl00g6dTMLMjVJC/8ONC0Xc+i/u/BsnbRbaemDgcM693DMnSKQw6LrfTmltfWNzq7xd2dnd2z+oHh61TZxqxlsslrHuBtRwKRRvoUDJu4nmNAok7wSTm9zvPHJtRKzucZpwP6IjJULBKFrpoR9RHAdhNp4NMpwNqjW37s5BVolXkBoUaA6qX/1hzNKIK2SSGtPz3AT9jGoUTPJZpZ8anlA2oSPes1TRiBs/m6eekTOrDEkYa/sUkrn6eyOjkTHTKLCTeUqz7OXif14vxfDaz4RKUuSKLQ6FqSQYk7wCMhSaM5RTSyjTwmYlbEw1ZWiLqtgSvOUvr5L2Rd2z/O6y1rgs6ijDCZzCOXhwBQ24hSa0gIGGZ3iFN+fJeXHenY/FaMkpdo7hD5zPHz/AkvE=</latexit>

↵t,i<latexit sha1_base64="XX1apdSwb4rKbFlvcYd+PKLovW4=">AAAB83icbZBNS8NAEIYn9avWr6pHL4tF8CAlkYIeC148VrAf0IQy2W7apZtN2N0IJfRvePGgiFf/jDf/jds2B219YeHhnRlm9g1TwbVx3W+ntLG5tb1T3q3s7R8cHlWPTzo6yRRlbZqIRPVC1ExwydqGG8F6qWIYh4J1w8ndvN59YkrzRD6aacqCGEeSR5yisZbvo0jHOMjNFZ8NqjW37i5E1sEroAaFWoPqlz9MaBYzaahArfuem5ogR2U4FWxW8TPNUqQTHLG+RYkx00G+uHlGLqwzJFGi7JOGLNzfEznGWk/j0HbGaMZ6tTY3/6v1MxPdBjmXaWaYpMtFUSaIScg8ADLkilEjphaQKm5vJXSMCqmxMVVsCN7ql9ehc133LD80as1GEUcZzuAcLsGDG2jCPbSgDRRSeIZXeHMy58V5dz6WrSWnmDmFP3I+fwAS+pGm</latexit><latexit sha1_base64="XX1apdSwb4rKbFlvcYd+PKLovW4=">AAAB83icbZBNS8NAEIYn9avWr6pHL4tF8CAlkYIeC148VrAf0IQy2W7apZtN2N0IJfRvePGgiFf/jDf/jds2B219YeHhnRlm9g1TwbVx3W+ntLG5tb1T3q3s7R8cHlWPTzo6yRRlbZqIRPVC1ExwydqGG8F6qWIYh4J1w8ndvN59YkrzRD6aacqCGEeSR5yisZbvo0jHOMjNFZ8NqjW37i5E1sEroAaFWoPqlz9MaBYzaahArfuem5ogR2U4FWxW8TPNUqQTHLG+RYkx00G+uHlGLqwzJFGi7JOGLNzfEznGWk/j0HbGaMZ6tTY3/6v1MxPdBjmXaWaYpMtFUSaIScg8ADLkilEjphaQKm5vJXSMCqmxMVVsCN7ql9ehc133LD80as1GEUcZzuAcLsGDG2jCPbSgDRRSeIZXeHMy58V5dz6WrSWnmDmFP3I+fwAS+pGm</latexit><latexit sha1_base64="XX1apdSwb4rKbFlvcYd+PKLovW4=">AAAB83icbZBNS8NAEIYn9avWr6pHL4tF8CAlkYIeC148VrAf0IQy2W7apZtN2N0IJfRvePGgiFf/jDf/jds2B219YeHhnRlm9g1TwbVx3W+ntLG5tb1T3q3s7R8cHlWPTzo6yRRlbZqIRPVC1ExwydqGG8F6qWIYh4J1w8ndvN59YkrzRD6aacqCGEeSR5yisZbvo0jHOMjNFZ8NqjW37i5E1sEroAaFWoPqlz9MaBYzaahArfuem5ogR2U4FWxW8TPNUqQTHLG+RYkx00G+uHlGLqwzJFGi7JOGLNzfEznGWk/j0HbGaMZ6tTY3/6v1MxPdBjmXaWaYpMtFUSaIScg8ADLkilEjphaQKm5vJXSMCqmxMVVsCN7ql9ehc133LD80as1GEUcZzuAcLsGDG2jCPbSgDRRSeIZXeHMy58V5dz6WrSWnmDmFP3I+fwAS+pGm</latexit><latexit sha1_base64="XX1apdSwb4rKbFlvcYd+PKLovW4=">AAAB83icbZBNS8NAEIYn9avWr6pHL4tF8CAlkYIeC148VrAf0IQy2W7apZtN2N0IJfRvePGgiFf/jDf/jds2B219YeHhnRlm9g1TwbVx3W+ntLG5tb1T3q3s7R8cHlWPTzo6yRRlbZqIRPVC1ExwydqGG8F6qWIYh4J1w8ndvN59YkrzRD6aacqCGEeSR5yisZbvo0jHOMjNFZ8NqjW37i5E1sEroAaFWoPqlz9MaBYzaahArfuem5ogR2U4FWxW8TPNUqQTHLG+RYkx00G+uHlGLqwzJFGi7JOGLNzfEznGWk/j0HbGaMZ6tTY3/6v1MxPdBjmXaWaYpMtFUSaIScg8ADLkilEjphaQKm5vJXSMCqmxMVVsCN7ql9ehc133LD80as1GEUcZzuAcLsGDG2jCPbSgDRRSeIZXeHMy58V5dz6WrSWnmDmFP3I+fwAS+pGm</latexit>

W a<latexit sha1_base64="o/7THC0TkPZOjvzdja8YALm5Aec=">AAACAnicbZC7SgNBFIbPxltcb1FLm8EgWIVdEbQzYGMZ0VwgieHsZHYzZPbCzKwYlnS2tko6azuxEnwRX8GncJJYaOIPAx//OYdz5vcSwZV2nE8rt7C4tLySX7XX1jc2twrbOzUVp5KyKo1FLBseKiZ4xKqaa8EaiWQYeoLVvf75uF6/ZVLxOLrWg4S1Qwwi7nOK2lhX9RvsFIpOyZmIzIP7A8Wz99HoGQAqncJXqxvTNGSRpgKVarpOotsZSs2pYEO7lSqWIO1jwJoGIwyZameTU4fkwDhd4sfSvEiTift7IsNQqUHomc4QdU/N1sbmv7VAYtLj9G5mv/ZP2xmPklSziE7X+6kgOibjNEiXS0a1GBhAKrn5AaE9lEi1ycy2bROOOxvFPNSOSq7hS6dYPoap8rAH+3AILpxAGS6gAlWgEMADPMKTdW+9WK/W27Q1Z/3M7MIfWR/f9dCZ1w==</latexit><latexit sha1_base64="j+OzYygCa7D0bI3DA+05vlom6SQ=">AAACAnicbZDLSsNAFIZP6q3GW9Wlm8EiuCqJCLqz4MZlRXuBNpbJdJIOnUzCzEQsoTu3bhV9BHfiSvBFfAWfwknbhbb+MPDxn3M4Z34/4Uxpx/myCguLS8srxVV7bX1jc6u0vdNQcSoJrZOYx7LlY0U5E7Sumea0lUiKI5/Tpj84z+vNWyoVi8W1HibUi3AoWMAI1sa6at7gbqnsVJyx0Dy4UyiffTzneql1S9+dXkzSiApNOFaq7TqJ9jIsNSOcjuxOqmiCyQCHtG1Q4IgqLxufOkIHxumhIJbmCY3G7u+JDEdKDSPfdEZY99VsLTf/rYUSJ31G7mb26+DUy5hIUk0FmawPUo50jPI0UI9JSjQfGsBEMvMDRPpYYqJNZrZtm3Dc2SjmoXFUcQ1fOuXqMUxUhD3Yh0Nw4QSqcAE1qAOBEB7gEZ6se+vVerPeJ60FazqzC39kff4AfZSbnA==</latexit><latexit sha1_base64="j+OzYygCa7D0bI3DA+05vlom6SQ=">AAACAnicbZDLSsNAFIZP6q3GW9Wlm8EiuCqJCLqz4MZlRXuBNpbJdJIOnUzCzEQsoTu3bhV9BHfiSvBFfAWfwknbhbb+MPDxn3M4Z34/4Uxpx/myCguLS8srxVV7bX1jc6u0vdNQcSoJrZOYx7LlY0U5E7Sumea0lUiKI5/Tpj84z+vNWyoVi8W1HibUi3AoWMAI1sa6at7gbqnsVJyx0Dy4UyiffTzneql1S9+dXkzSiApNOFaq7TqJ9jIsNSOcjuxOqmiCyQCHtG1Q4IgqLxufOkIHxumhIJbmCY3G7u+JDEdKDSPfdEZY99VsLTf/rYUSJ31G7mb26+DUy5hIUk0FmawPUo50jPI0UI9JSjQfGsBEMvMDRPpYYqJNZrZtm3Dc2SjmoXFUcQ1fOuXqMUxUhD3Yh0Nw4QSqcAE1qAOBEB7gEZ6se+vVerPeJ60FazqzC39kff4AfZSbnA==</latexit><latexit sha1_base64="nSND6fuVeebSdXpyPhwNMuZ4Vmc=">AAACAnicbZDLSsNAFIZP6q3GW9Wlm2ARXJVEBF0W3LisaC/QxnIynaRDJ5MwMxFL6M6tW30Hd+LWF/EVfAqnbRba+sPAx3/O4Zz5g5QzpV33yyqtrK6tb5Q37a3tnd29yv5BSyWZJLRJEp7IToCKciZoUzPNaSeVFOOA03YwuprW2w9UKpaIOz1OqR9jJFjICGpj3bbvsV+pujV3JmcZvAKqUKjRr3z3BgnJYio04ahU13NT7ecoNSOcTuxepmiKZIQR7RoUGFPl57NTJ86JcQZOmEjzhHZm7u+JHGOlxnFgOmPUQ7VYm5r/1iKJ6ZCRx4X9Orz0cybSTFNB5uvDjDs6caZpOAMmKdF8bACJZOYHDhmiRKJNZrZtm3C8xSiWoXVW8wzfuNX6eRFTGY7gGE7BgwuowzU0oAkEIniGF3i1nqw36936mLeWrGLmEP7I+vwB7kKXDQ==</latexit>

W f<latexit sha1_base64="gR31S+Rs8ZBvNq+UQihktNP6lRQ=">AAACAnicbZC7SgNBFIbPxltcb1FLm8EgWIVdEWJnwMYyorlAsobZyexmyOzsMjMrhiWdra2SztpOrARfxFfwKZxcCk38YeDjP+dwzvx+wpnSjvNl5ZaWV1bX8uv2xubW9k5hd6+u4lQSWiMxj2XTx4pyJmhNM81pM5EURz6nDb9/Ma437qhULBY3epBQL8KhYAEjWBvrunEbdApFp+RMhBbBnUHx/GM0egGAaqfw3e7GJI2o0IRjpVquk2gvw1IzwunQbqeKJpj0cUhbBgWOqPKyyalDdGScLgpiaZ7QaOL+nshwpNQg8k1nhHVPzdfG5r+1UOKkx8j93H4dnHkZE0mqqSDT9UHKkY7ROA3UZZISzQcGMJHM/ACRHpaYaJOZbdsmHHc+ikWon5Rcw1dOsXIKU+XhAA7hGFwoQwUuoQo1IBDCIzzBs/VgvVpv1vu0NWfNZvbhj6zPH/3cmdw=</latexit><latexit sha1_base64="A2qrbrq5B5Y9pB1xPD8+WHCjlcM=">AAACAnicbZDLSsNAFIZP6q3GW9Wlm8EiuCqJCLqz4MZlRXuBNpbJdJIOnUzCzEQsoTu3bhV9BHfiSvBFfAWfwknbhbb+MPDxn3M4Z34/4Uxpx/myCguLS8srxVV7bX1jc6u0vdNQcSoJrZOYx7LlY0U5E7Sumea0lUiKI5/Tpj84z+vNWyoVi8W1HibUi3AoWMAI1sa6at4E3VLZqThjoXlwp1A++3jO9VLrlr47vZikERWacKxU23US7WVYakY4HdmdVNEEkwEOadugwBFVXjY+dYQOjNNDQSzNExqN3d8TGY6UGka+6Yyw7qvZWm7+WwslTvqM3M3s18GplzGRpJoKMlkfpBzpGOVpoB6TlGg+NICJZOYHiPSxxESbzGzbNuG4s1HMQ+Oo4hq+dMrVY5ioCHuwD4fgwglU4QJqUAcCITzAIzxZ99ar9Wa9T1oL1nRmF/7I+vwBhaCboQ==</latexit><latexit sha1_base64="A2qrbrq5B5Y9pB1xPD8+WHCjlcM=">AAACAnicbZDLSsNAFIZP6q3GW9Wlm8EiuCqJCLqz4MZlRXuBNpbJdJIOnUzCzEQsoTu3bhV9BHfiSvBFfAWfwknbhbb+MPDxn3M4Z34/4Uxpx/myCguLS8srxVV7bX1jc6u0vdNQcSoJrZOYx7LlY0U5E7Sumea0lUiKI5/Tpj84z+vNWyoVi8W1HibUi3AoWMAI1sa6at4E3VLZqThjoXlwp1A++3jO9VLrlr47vZikERWacKxU23US7WVYakY4HdmdVNEEkwEOadugwBFVXjY+dYQOjNNDQSzNExqN3d8TGY6UGka+6Yyw7qvZWm7+WwslTvqM3M3s18GplzGRpJoKMlkfpBzpGOVpoB6TlGg+NICJZOYHiPSxxESbzGzbNuG4s1HMQ+Oo4hq+dMrVY5ioCHuwD4fgwglU4QJqUAcCITzAIzxZ99ar9Wa9T1oL1nRmF/7I+vwBhaCboQ==</latexit><latexit sha1_base64="OwXaVSBywq4gtMvYtihM50gNBtI=">AAACAnicbZDLSgMxFIbP1Fsdb1WXboJFcFVmRNBlwY3LivYC7VgyaWYamkmGJCOWoTu3bvUd3IlbX8RX8ClM21lo6w+Bj/+cwzn5w5QzbTzvyymtrK6tb5Q33a3tnd29yv5BS8tMEdokkkvVCbGmnAnaNMxw2kkVxUnIaTscXU3r7QeqNJPizoxTGiQ4FixiBBtr3bbvo36l6tW8mdAy+AVUoVCjX/nuDSTJEioM4Vjrru+lJsixMoxwOnF7maYpJiMc065FgROqg3x26gSdWGeAIqnsEwbN3N8TOU60Hieh7UywGerF2tT8txYrnA4ZeVzYb6LLIGcizQwVZL4+yjgyEk3TQAOmKDF8bAETxewPEBlihYmxmbmua8PxF6NYhtZZzbd841Xr50VMZTiCYzgFHy6gDtfQgCYQiOEZXuDVeXLenHfnY95acoqZQ/gj5/MH9k6XEg==</latexit>

Wh<latexit sha1_base64="DyfDkJHotMquWPx/hg/P2scHPUI=">AAACBHicbZC7TsMwFIZPyq2EW4GRxaJCYqoShFQ2KrEwFom0ldpQOa7TWHWcyHYQVdSVlZVuPAAbYmDhPXgFngL3MkDLL1n69J9zdI7/IOVMacf5sgorq2vrG8VNe2t7Z3evtH/QUEkmCfVIwhPZCrCinAnqaaY5baWS4jjgtBkMrib15j2ViiXiVg9T6se4L1jICNbG8pp3eTTqlspOxZkKLYM7h/Llx3j8AgD1bum700tIFlOhCcdKtV0n1X6OpWaE05HdyRRNMRngPm0bFDimys+nx47QiXF6KEykeUKjqft7IsexUsM4MJ0x1pFarE3Mf2t9idOIkYeF/Tq88HMm0kxTQWbrw4wjnaBJHqjHJCWaDw1gIpn5ASIRlphok5pt2yYcdzGKZWicVVzDN065dg4zFeEIjuEUXKhCDa6hDh4QYPAEzzC2Hq1X6816n7UWrPnMIfyR9fkD3Rya6g==</latexit><latexit sha1_base64="0Pht8l+GskUtnar2APfAuaeiyGg=">AAACBHicbZDNSsNAFIVv/K3xr+rSzWARXJVEBN1ZcOOygmkLbSyT6aQZOpmEmYlYQrdu3VqfwZ24cON7+Ao+hZO2C209MPBx7r3cOydIOVPacb6speWV1bX10oa9ubW9s1ve22+oJJOEeiThiWwFWFHOBPU005y2UklxHHDaDAZXRb15T6ViibjVw5T6Me4LFjKCtbG85l0ejbrlilN1JkKL4M6gcvkxLvRS75a/O72EZDEVmnCsVNt1Uu3nWGpGOB3ZnUzRFJMB7tO2QYFjqvx8cuwIHRunh8JEmic0mri/J3IcKzWMA9MZYx2p+Vph/lvrS5xGjDzM7dfhhZ8zkWaaCjJdH2Yc6QQVeaAek5RoPjSAiWTmB4hEWGKiTWq2bZtw3PkoFqFxWnUN3ziV2hlMVYJDOIITcOEcanANdfCAAIMneIax9Wi9Wm/W+7R1yZrNHMAfWZ8/ZOCcrw==</latexit><latexit sha1_base64="0Pht8l+GskUtnar2APfAuaeiyGg=">AAACBHicbZDNSsNAFIVv/K3xr+rSzWARXJVEBN1ZcOOygmkLbSyT6aQZOpmEmYlYQrdu3VqfwZ24cON7+Ao+hZO2C209MPBx7r3cOydIOVPacb6speWV1bX10oa9ubW9s1ve22+oJJOEeiThiWwFWFHOBPU005y2UklxHHDaDAZXRb15T6ViibjVw5T6Me4LFjKCtbG85l0ejbrlilN1JkKL4M6gcvkxLvRS75a/O72EZDEVmnCsVNt1Uu3nWGpGOB3ZnUzRFJMB7tO2QYFjqvx8cuwIHRunh8JEmic0mri/J3IcKzWMA9MZYx2p+Vph/lvrS5xGjDzM7dfhhZ8zkWaaCjJdH2Yc6QQVeaAek5RoPjSAiWTmB4hEWGKiTWq2bZtw3PkoFqFxWnUN3ziV2hlMVYJDOIITcOEcanANdfCAAIMneIax9Wi9Wm/W+7R1yZrNHMAfWZ8/ZOCcrw==</latexit><latexit sha1_base64="EejavYJ6Faje7vHSR8Xemgy67Sc=">AAACBHicbZDNSsNAFIVv6l+Nf1WXbgaL4KokIuiy4MZlBdMWbCyT6aQZOpkJMxOxhG7dutV3cCdufQ9fwadw2mahrQcGPs69l3vnRBln2njel1NZWV1b36huulvbO7t7tf2Dtpa5IjQgkkvVjbCmnAkaGGY47WaK4jTitBONrqb1zgNVmklxa8YZDVM8FCxmBBtrBZ37Ipn0a3Wv4c2ElsEvoQ6lWv3ad28gSZ5SYQjHWt/5XmbCAivDCKcTt5drmmEywkN6Z1HglOqwmB07QSfWGaBYKvuEQTP390SBU63HaWQ7U2wSvVibmv/WhgpnCSOPC/tNfBkWTGS5oYLM18c5R0aiaR5owBQlho8tYKKY/QEiCVaYGJua67o2HH8ximVonzV8yzdevXlexlSFIziGU/DhAppwDS0IgACDZ3iBV+fJeXPenY95a8UpZw7hj5zPH9WOmCA=</latexit>

at<latexit sha1_base64="n+shUa15liTTJm2+/QOvxfA6idI=">AAAB83icbVDLSsNAFL2pr1pfVZduBovgqiQi6LLgxmUF+4AmlMl00g6dTMLMjVBCf8ONC0Xc+jPu/BsnbRbaemDgcM693DMnTKUw6LrfTmVjc2t7p7pb29s/ODyqH590TZJpxjsskYnuh9RwKRTvoEDJ+6nmNA4l74XTu8LvPXFtRKIecZbyIKZjJSLBKFrJ92OKkzDK6XyIw3rDbboLkHXilaQBJdrD+pc/SlgWc4VMUmMGnptikFONgkk+r/mZ4SllUzrmA0sVjbkJ8kXmObmwyohEibZPIVmovzdyGhszi0M7WWQ0q14h/ucNMoxug1yoNEOu2PJQlEmCCSkKICOhOUM5s4QyLWxWwiZUU4a2ppotwVv98jrpXjU9yx+uG63rso4qnME5XIIHN9CCe2hDBxik8Ayv8OZkzovz7nwsRytOuXMKf+B8/gBol5He</latexit><latexit sha1_base64="n+shUa15liTTJm2+/QOvxfA6idI=">AAAB83icbVDLSsNAFL2pr1pfVZduBovgqiQi6LLgxmUF+4AmlMl00g6dTMLMjVBCf8ONC0Xc+jPu/BsnbRbaemDgcM693DMnTKUw6LrfTmVjc2t7p7pb29s/ODyqH590TZJpxjsskYnuh9RwKRTvoEDJ+6nmNA4l74XTu8LvPXFtRKIecZbyIKZjJSLBKFrJ92OKkzDK6XyIw3rDbboLkHXilaQBJdrD+pc/SlgWc4VMUmMGnptikFONgkk+r/mZ4SllUzrmA0sVjbkJ8kXmObmwyohEibZPIVmovzdyGhszi0M7WWQ0q14h/ucNMoxug1yoNEOu2PJQlEmCCSkKICOhOUM5s4QyLWxWwiZUU4a2ppotwVv98jrpXjU9yx+uG63rso4qnME5XIIHN9CCe2hDBxik8Ayv8OZkzovz7nwsRytOuXMKf+B8/gBol5He</latexit><latexit sha1_base64="n+shUa15liTTJm2+/QOvxfA6idI=">AAAB83icbVDLSsNAFL2pr1pfVZduBovgqiQi6LLgxmUF+4AmlMl00g6dTMLMjVBCf8ONC0Xc+jPu/BsnbRbaemDgcM693DMnTKUw6LrfTmVjc2t7p7pb29s/ODyqH590TZJpxjsskYnuh9RwKRTvoEDJ+6nmNA4l74XTu8LvPXFtRKIecZbyIKZjJSLBKFrJ92OKkzDK6XyIw3rDbboLkHXilaQBJdrD+pc/SlgWc4VMUmMGnptikFONgkk+r/mZ4SllUzrmA0sVjbkJ8kXmObmwyohEibZPIVmovzdyGhszi0M7WWQ0q14h/ucNMoxug1yoNEOu2PJQlEmCCSkKICOhOUM5s4QyLWxWwiZUU4a2ppotwVv98jrpXjU9yx+uG63rso4qnME5XIIHN9CCe2hDBxik8Ayv8OZkzovz7nwsRytOuXMKf+B8/gBol5He</latexit><latexit sha1_base64="n+shUa15liTTJm2+/QOvxfA6idI=">AAAB83icbVDLSsNAFL2pr1pfVZduBovgqiQi6LLgxmUF+4AmlMl00g6dTMLMjVBCf8ONC0Xc+jPu/BsnbRbaemDgcM693DMnTKUw6LrfTmVjc2t7p7pb29s/ODyqH590TZJpxjsskYnuh9RwKRTvoEDJ+6nmNA4l74XTu8LvPXFtRKIecZbyIKZjJSLBKFrJ92OKkzDK6XyIw3rDbboLkHXilaQBJdrD+pc/SlgWc4VMUmMGnptikFONgkk+r/mZ4SllUzrmA0sVjbkJ8kXmObmwyohEibZPIVmovzdyGhszi0M7WWQ0q14h/ucNMoxug1yoNEOu2PJQlEmCCSkKICOhOUM5s4QyLWxWwiZUU4a2ppotwVv98jrpXjU9yx+uG63rso4qnME5XIIHN9CCe2hDBxik8Ayv8OZkzovz7nwsRytOuXMKf+B8/gBol5He</latexit>

ft,i<latexit sha1_base64="ktF8xXl3NK9OVmDoKViTCF4Qkco=">AAACEXicbZDLSsNAFIZP6q3GW9SFCzfBIriQkoigy4Ib3VWwF2hDmEwn7dDJJMxMiiXkKdy61XdwJ259AneuXPgUTtoutPWHgY//nMM58wcJo1I5zodRWlpeWV0rr5sbm1vbO9buXlPGqcCkgWMWi3aAJGGUk4aiipF2IgiKAkZawfCqqLdGREga8zs1TogXoT6nIcVIacu3rG6E1CAIszD3M3VKc9+qOFVnInsR3BlUagc3X58AUPet724vxmlEuMIMSdlxnUR5GRKKYkZys5tKkiA8RH3S0chRRKSXTS7P7WPt9OwwFvpxZU/c3xMZiqQcR4HuLO6U87XC/LfWFygZUHw/t1+Fl15GeZIqwvF0fZgyW8V2EY7do4JgxcYaEBZU/8DGAyQQVjpC0zR1OO58FIvQPKu6mm91SucwVRkO4QhOwIULqME11KEBGEbwCE/wbDwYL8ar8TZtLRmzmX34I+P9B7rBn5w=</latexit><latexit sha1_base64="QLQs+B94/rq6grK98074kaO1exA=">AAACEXicbZDLSsNAFIZP6q3GW9SFCzfBIriQkoigy4Ib3VWwF2hLnEwn7dDJJMxMiiXkKdy61VcQd+LWJ/AFXLjwGZy0XWjrDwMf/zmHc+b3Y0alcpwPo7CwuLS8Ulw119Y3Nres7Z26jBKBSQ1HLBJNH0nCKCc1RRUjzVgQFPqMNPzBRV5vDImQNOI3ahSTToh6nAYUI6Utz7LaIVJ9P0iDzEvVMc08q+SUnbHseXCnUKrsXX1+Pxduq5711e5GOAkJV5ghKVuuE6tOioSimJHMbCeSxAgPUI+0NHIUEtlJx5dn9qF2unYQCf24ssfu74kUhVKOQl935nfK2Vpu/lvrCRT3Kb6b2a+C805KeZwowvFkfZAwW0V2Ho7dpYJgxUYaEBZU/8DGfSQQVjpC0zR1OO5sFPNQPym7mq91SqcwURH24QCOwIUzqMAlVKEGGIbwAI/wZNwbL8ar8TZpLRjTmV34I+P9B2EUoMM=</latexit><latexit sha1_base64="QLQs+B94/rq6grK98074kaO1exA=">AAACEXicbZDLSsNAFIZP6q3GW9SFCzfBIriQkoigy4Ib3VWwF2hLnEwn7dDJJMxMiiXkKdy61VcQd+LWJ/AFXLjwGZy0XWjrDwMf/zmHc+b3Y0alcpwPo7CwuLS8Ulw119Y3Nres7Z26jBKBSQ1HLBJNH0nCKCc1RRUjzVgQFPqMNPzBRV5vDImQNOI3ahSTToh6nAYUI6Utz7LaIVJ9P0iDzEvVMc08q+SUnbHseXCnUKrsXX1+Pxduq5711e5GOAkJV5ghKVuuE6tOioSimJHMbCeSxAgPUI+0NHIUEtlJx5dn9qF2unYQCf24ssfu74kUhVKOQl935nfK2Vpu/lvrCRT3Kb6b2a+C805KeZwowvFkfZAwW0V2Ho7dpYJgxUYaEBZU/8DGfSQQVjpC0zR1OO5sFPNQPym7mq91SqcwURH24QCOwIUzqMAlVKEGGIbwAI/wZNwbL8ar8TZpLRjTmV34I+P9B2EUoMM=</latexit><latexit sha1_base64="6M54R6qtTMt3L+HOcSYYApdcPjE=">AAACEXicbZDLSsNAFIYn9VbjLerSzWARXEhJRNBlwY3LCvYCbQiT6aQdOpmEmZNiCX0Kt271HdyJW5/AV/ApnLRZaOsPAx//OYdz5g9TwTW47pdVWVvf2Nyqbts7u3v7B87hUVsnmaKsRRORqG5INBNcshZwEKybKkbiULBOOL4t6p0JU5on8gGmKfNjMpQ84pSAsQLH6ccERmGUR7Mghws+C5yaW3fnwqvglVBDpZqB890fJDSLmQQqiNY9z03Bz4kCTgWb2f1Ms5TQMRmynkFJYqb9fH75DJ8ZZ4CjRJknAc/d3xM5ibWexqHpLO7Uy7XC/Lc2VCQdcfq4tB+iGz/nMs2ASbpYH2UCQ4KLcPCAK0ZBTA0Qqrj5AaYjoggFE6Ft2yYcbzmKVWhf1j3D926tcVXGVEUn6BSdIw9dowa6Q03UQhRN0DN6Qa/Wk/VmvVsfi9aKVc4coz+yPn8ARTedNg==</latexit>

Fig. 3. Left: Diagram of attention network. Right: Visualization of at-tentional weights map: for each pair, (top) search images and ground-truth target box, and (bottom) attention maps over search image. Forvisualization, the attention maps are resized using bicubic interpolationto match the size of the original image.

location is unknown at first, we apply an attention mecha-nism to make the input to the LSTM concentrate more onthe target. We define Ft,i ∈ Rn×n×c as the i-th n × n × csquare patch on Ft = f(St) in a sliding window fashion.1

Each square patch covers a certain part of the search image.An attention-based weighted sum of these square patchescan be regarded as a soft representation of the object, whichcan then be fed into the LSTM to generate a proper read keyfor memory retrieval. However the size of this soft featuremap is still too large to directly feed into the LSTM. Tofurther reduce the size of each square patch, we first adoptan average pooling with n× n filter size on Ft,

ft = AvgPoolingn×n(Ft) (1)

and ft,i ∈ Rc is the feature vector for the ith patch.The attended feature vector is then computed as the

weighted sum of the feature vectors,

at =L∑

i=1

αt,ift,i (2)

where L is the number of square patches, and the attentionweights αt,i is calculated by a softmax,

αt,i =exp(rt,i)∑Lk=1 exp(rt,k)

(3)

where

rt,i =W atanh(Whht−1 +W f ft,i + b) (4)

is an attention network (Figure 3: Left), which takes theprevious hidden state ht−1 of the LSTM and a square patchf∗t,i as input. W a,Wh,W f and b are weight matrices andbiases for the network.

By comparing the target’s historical information in theprevious hidden state with each square patch, the attentionnetwork can generate attentional weights that have highervalues on the target and smaller values for surroundingregions. Figure 3 (right) shows example search images withattention weight maps. Our attention network can alwaysfocus on the target which is beneficial when retrievingmemory for template matching.

1. We use 6×6×256, which is the same size of the matching template.

3.3 LSTM Memory ControllerFor each time step, the LSTM controller takes the attendedfeature vector at, obtained by the attention module, andthe previous hidden state ht−1 as input, and outputs thenew hidden state ht to calculate the memory control signals,including read key, read strength, bias gates, and decay rate(discussed later). The internal architecture of the LSTM usesthe standard model, while the output layer is modified togenerate the control signals. In addition, we also use layernormalization [47] and dropout regularization [27] for theLSTM. The initial hidden state h0 and cell state c0 areobtained by passing the initial target’s feature map throughone n × n average pooling layer and two separate fully-connected layer with tanh activation functions, respectively.

3.4 Memory ReadingMemory is retrieved by computing a weighted sum ofall memory slots with the read weight vector, which isdetermined by the cosine similarity between the read keyand the memory keys. This aims at retrieving the mostrelated template stored in memory. Since the memory read-ing processes for positive and negative are similar, we willonly show the positive case. Suppose Mt ∈ RN×n×n×crepresents the memory module, such that Mt(j) ∈ Rn×n×cis the template stored in the jth memory slot and N is thenumber of memory slots. The LSTM controller outputs theread key kt ∈ Rc and read strength βt ∈ [1,∞],

kt =Wkht + bk, (5)

βt =1 + log(1 + exp(W βht + bβ)), (6)

where W k,W β , bk, bβ are the weight matrices and biases.The read key kt is used for matching the contents in mem-ory, while the read strength βt indicates the reliability of thegenerated read key. Given the read key and read strength, aread weight wr

t ∈ RN is computed for memory retrieval,

wrt (j) =

exp {C(kt,kMt(j))βt}∑j′ exp {C(kt,kMt(j′))βt}

, (7)

where kMt(j) ∈ Rc is the memory key generated by a n× naverage pooling on Mt(j). C(x,y) is the cosine similaritybetween vectors, C(x,y) = x·y

‖x‖‖y‖ . Finally, the template isretrieved from memory as a weighted sum,

Tretrt =

N∑

j=1

wrt (j)Mt(j). (8)

3.5 Residual Template LearningDirectly using the retrieved template for similarity matchingis prone to overfit recent frames. Instead, we learn a resid-ual template by multiplying the retrieved template with achannel-wise gate vector and adding it to the initial tem-plate to capture appearance changes. Therefore, our positivetemplate is formulated as,

Tpost = T0 + rt �Tretr

t , (9)

where T0 is the initial template and � is channel-wisemultiplication. rt ∈ Rc is the residual gate produced by theLSTM controller,

rt = σ(W rht + br), (10)

Page 6: JOURNAL OF LA Visual Tracking via Dynamic Memory Networks · JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 1 Visual Tracking via Dynamic Memory Networks Tianyu Yang and

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 6

0.3137 0.4853 0.4859 0.4918 0.4929

0.4946

0.5010

0.8564

0.8680

0.8858

0.4951

0.5018

0.8595

0.8708

0.8902

0.4964

0.5022

0.8596

0.8716

0.8917

0.4967

0.5023

0.8623

0.8736

0.8957

0.4981

0.5039

0.8661

0.8826

0.8984

0.4935

0.5008

0.8529

0.8677

0.8852

Fig. 4. Left: The feature channels respond to target parts: images arereconstructed from conv5 of the CNN used in our tracker. Each image isgenerated by accumulating reconstructed pixels from the same channel.The input image is shown in the top-left. Right: Channel visualizations ofa retrieved template along with their corresponding residual gate valuesin the left-top corner.

where W r, br are the weights and bias, and σ representssigmoid function. The residual gate controls how much eachchannel of the retrieved template is added to the initial tem-plate, which can be regarded as a form of feature selection.

By projecting different channels of a target feature mapto pixel-space using deconvolution, as in [48], we find thatthe channels focus on different object parts (Figure 4: Left).To show the behavior of residual learning, we also visualizethe retrieved template along with its residual gates in Figure4 (right). The channels that correspond to regions withappearance changes (the bottom part of the face is occluded)have higher residual gate values, demonstrating that theresidual learning scheme adapts the initial template to ap-pearance variations. In addition, channels corresponding toprevious target appearances are also retrieved from memory(e.g., 6th row, 5th column; the nose and mouth are bothvisible meaning that they are not occluded), demonstratingthat our residual learning does not overfit to recent frames.Thus, the channel-wise feature residual learning has the ad-vantage of updating different object parts separately. Exper-iments in Section 5.3 show that this yields a big performanceimprovement.

3.6 Distractor Template Canceling and Final TemplateAs is shown in Section 3.5, the feature channels respond todifferent object parts. The channels of the positive templatethat are similar to those of a distractor template are consid-ered as not discriminative. Thus we propose to cancel thesefeature channels of the positive template via a canceling gateto obtain the final template,

Tfinalt = T

post − ct �T

negt , (11)

where Tnegt is the distractor (negative) template which is

retrieved from negative memory (as in Section 3.4), and ctis the canceling gate produced by comparing the positiveand negative templates,

ct = σ(W ctanh(W pos1×1×c ∗T

post +Wneg

1×1×c ∗Tnegt + bc))

(12)

where W pos,Wneg are 1×1×c convolution filters, {W c, bc}are the weights and bias, and ∗ is the convolution opera-tion. This process weakens the weight of non-discriminativechannels when forming the final response map, leading toan emphasis of discerning channels. To demonstrate the ef-fect of distractor template canceling, we show the responses

#16 #17 #19#18 #20

Image Patch

Response byMemDTC

Response byMemTrack

Fig. 5. Example responses comparing tracking with distractor templatecanceling (MemDTC) and without (MemTrack).

generated with distractor template canceling (MemDTC)and without it (MemTrack) in Figure 5. The response mapsgenerated by MemDTC are less cluttered than those ofMemTrack, and MemDTC effectively suppresses responsesfrom similar looking distractors (i.e., the other runners).

The response map is generated by convolving the searchfeature map with the final template as the filter, which isequivalent to calculating the correlation score between thetemplate and each translated sub-window on the search fea-ture map in a sliding window fashion. The displacement ofthe target from the last frame to current frame is calculatedby multiplying the position of the maximum score, relativeto the center of the response map, with the feature stride.The size of the bounding box is determined by searchingat multiple scales of the feature map. This is done using asingle feed forward computation by assembling all scaledimages as a mini-batch, which is very efficient in moderndeep learning libraries.

3.7 Positive Memory Writing

The image patch with the new position of the target is usedfor positive memory writing. The new object template Tnew

t

is computed using the feature extraction CNN. There arethree cases for memory writing: 1) when the new objecttemplate is not reliable (e.g. contains a lot of background),there is no need to write new information into memory;2) when the new object appearance does not change muchcompared with the previous frame, the memory slot thatwas previously read should be updated; 3) when the newtarget has a large appearance change, a new memory slotshould be overwritten. To handle these three cases, wedefine the write weight as

wwt = gs0+ grwr

t + gawat , (13)

where 0 is the zero vector, wrt is the read weight, and wa

t

is the allocation weight, which is responsible for allocatinga new position for memory writing. The skip gate gs, readgate gr and allocation gate ga, are produced by the LSTMcontroller with a softmax function,

[gs, gr, ga] = softmax(W ght + bg), (14)

whereW g, bg are the weights and biases. Since gs+gr+ga =1, these three gates govern the interpolation between thethree cases. If gs = 1, then ww

t = 0 and nothing is written.If gr or ga have higher value, then the new template is eitherused to update the old template (using wr

t ) or written into

Page 7: JOURNAL OF LA Visual Tracking via Dynamic Memory Networks · JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 1 Visual Tracking via Dynamic Memory Networks Tianyu Yang and

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 7

Slot 1 Slot 2 Slot 3

Read Weight

LSTM Controller

Memory Key Memory Key

Read Key

Retrieved Template

New Template

Write Weight

Bias Gates

Access Vector

Write

Decay Rate

Read

Read Strength

Erase Factorkt

�t

drgw, gr, ga

ew

Positive Memory

Fig. 6. Diagram of positive memory access mechanism, including read-ing and writing process.

newly allocated position (using wat ). The allocation weight is

calculated by,

wat (j) =

1, if j = argmin

jwut−1(j)

0, otherwise(15)

where wut is the access vector,

wut = λwu

t−1 +wrt +ww

t , (16)

which indicates the frequency of memory access (both read-ing and writing), and λ is a decay factor. Memory slots thatare accessed infrequently will be assigned new templates.As is shown in Figure 1, our memory network is able tolearn the appropriate behavior for effectively updating orallocating new templates to handle appearance variations.

The writing process is performed with a write weight inconjunction with an erase factor for clearing the memory,

Mpt+1(j) = Mp

t (j)(1−wwt (j)e

w) +wt(j)wewTnew

t , (17)

where ew is the erase factor computed by

ew = drgr + ga, (18)

and dr ∈ [0, 1] is the decay rate produced by the LSTMcontroller,

dr = σ(W dht + bd), (19)

where W d, bd are the weights and bias. If gr = 1 (and thusga = 0), then dr serves as the decay rate for updating thetemplate in the memory slot (Case 2). If ga = 1 (and gr = 0),dr has no effect on ew, and thus the memory slot will beerased before writing the new template (Case 3). Figure 6shows the detailed diagram of the positive memory readingand writing process.

3.8 Negative Memory Writing

For negative memory, the distractor templates for memorywriting are extracted from the search feature map basedon their response score. Those with high score peak andare far away from the target are considered as distractortemplates. Following the notation of Section 3.2, Ft,i is thei-th n × n × c square patch on search feature map Ft. The

set of distractor templates is defined as the top-K templates(based on response score),

Sdis = {Ft,i | D(i, ∗) > τ,R(Ft,i) > γR(Ft,∗),

Ft,i ∈K

argmaxFt

R(Ft)},(20)

where R(Ft,i) is the response score for Ft,i, D(i, ∗) is thespatial distance between the centers of the two templates,and Ft,∗ is the template with maximum response score. Theoperator argmaxK returns the set with the top-K values.τ is a distance threshold, and γ is a score ratio threshold.Note that if there are no distractor templates satisfying theabove criterion, a zero-filled template will be written intothe negative memory, which will have no effect on formingthe final template.

As the distractor template is usually temporary andchanges frequently, we simplify the memory writing processby only using an allocation weight as the write weight. Thusthe negative memory writing is similar to a memory queue

Mnt+1(j) =

K∑

k

Mnt (j)(1−wna

t,k(j)) +wnat,k(j)T

dist,k, (21)

where Tdist,k ∈ Sdis stands for the k-th distractor template

selected based on response score. wnat,k is the allocation

weight for negative memory, which is calculated by

wnat,k(j) =

{1, if j = p(k), p(k) ∈ Salloc0, otherwise

(22)

where Salloc =K

argminj

wut−1(j) represents the top-K newly

allocated memory positions.

3.9 Auxiliary Classification LossAs is stated in [49], robust visual tracking needs both fine-grained details to accurately localize the target and semanticinformation to be robust to appearance variations caused bydeformation or occlusion. Features learned with similaritymatching are mainly focused on precise localization of thetarget. Thus, we propose to add an auxiliary classificationbranch after the last layer of the CNN feature extractor toguide the networks to learn complementary semantic infor-mation. The classification branch contains a fully-connectedlayer with 1024 neurons and ReLU activations, followedby a fully connected layer with 30 neurons (there are 30categories in the training data) with a softmax function. Thefinal loss for optimization is formed by two parts, matchingloss and classification loss,

L(R,R∗, p, p∗) = Lmch(R,R∗) + κLcls(p, p

∗), (23)

where R,R∗ are the predicted response and groundtruthresponse. p, p∗ are the predicted probability and thegroundtruth class of the object. Lmch is an element-wisesigmoid cross entropy loss as in [13],

Lmch(R,R∗) =

1

|D|∑

u∈D`(Ru, R

∗u), (24)

where D are the positions in the score map, and `(·) is thesigmoid cross entropy loss. Lcls is the softmax cross entropyloss. κ is a balancing factor between the two losses. Note thatthe classification branch will be removed during testing.

Page 8: JOURNAL OF LA Visual Tracking via Dynamic Memory Networks · JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 1 Visual Tracking via Dynamic Memory Networks Tianyu Yang and

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 8

4 IMPLEMENTATION DETAILS

We adopt an Alex-like CNN as in SiamFC [13] for feature ex-traction, where the input image sizes of the object and searchimages are 127×127×3 and 255×255×3, respectively. We usethe same strategy for cropping the search and object imagesas [13], where some context margins around the target areadded when cropping the object image. Specifically, giventhe newly predicted bounding box {xt, yt, wt, ht} (center x,center y, width, height) in frame t, the cropping ROI for theobject image patch is calculated by

xot = xt, yot = yt, wot = hot =√(c+ wt)(c+ ht), (25)

where c = δ ∗ (wt + ht) is the context length and δ = 0.5 isthe context factor. For frame t+ 1, the ROI cropping for thesearch image patch is computed by

xst+1 = xt, yst+1 = yt,

wst+1 = hst+1 =255− 127

127∗ wot + wot .

(26)

Note that the cropped object image patch and search im-age patch are then resized to 127×127×3 and 255×255×3,respectively, to match the network input.

The whole network is trained offline on the VID dataset(object detection from video) of ILSVRC [50] from scratch,and takes about one day. Adam [51] optimization is usedwith a mini-batches of 8 video clips of length 16. Theinitial learning rate is 1e-4 and is multiplied by 0.8 every10k iterations. The video clip is constructed by uniformlysampling frames (while keeping the temporal order) fromeach video. This aims to diversify the appearance variationsin one episode for training, which can simulate fast motion,fast background change, jittering object, low frame rate.We use data augmentation, including small image stretchand translation for the target image and search image. Thedimension of memory states in the LSTM controller is 512and the retain probability used in dropout for LSTM is 0.8.The number of positive memory slots and negative memoryslots are Npos = 8, Nneg = 16. The distance threshold andscore ratio threshold are τ = 4, γ = 0.7 and the number ofselected distractor templates is K = 2. The balancing factorfor auxiliary loss is κ = 0.05. The decay factor used forcalculating the access vector is λ = 0.99.

At test time, the tracker runs completely feed-forwardand no online fine-tuning is needed. We locate the targetbased on the upsampled response map as in SiamFC [13],and handle the scale changes by searching for the targetover three scales 1.05[−1,0,1]. To smoothen scale estimationand penalize large displacements, we update the object scalewith the new one by an exponential smoothing factor of 0.5and dampen the response map with a cosine window by anexponential smoothing factor of 0.19.

Our algorithm is implemented in Python with the Ten-sorFlow toolbox [52], and tested on a computer with fourIntel(R) Core(TM) i7-7700 CPU @ 3.60GHz and a singleNVIDIA GTX 1080 Ti with 11GB RAM. It runs about 50fps for MemTrack and MemTrack* and about 40 fps forMemDTC and MemDTC*.

5 EXPERIMENTS

We evaluate our preliminary tracker [22] which only haspositive memory networks (MemTrack), as well as three

0 10 20 30 40 50

Location error threshold

0

0.2

0.4

0.6

0.8

1

Pre

cis

ion

Precision plots of OPE

DSiamM [0.891]

MemDTC* [0.884]

SiamRPN [0.884]

PTAV [0.879]

MemTrack* [0.871]

MemDTC [0.866]

ACFN [0.860]

MemTrack [0.849]

LMCF [0.842]

SiamFC [0.809]

SiamFC* [0.806]

Staple [0.793]

RFL [0.786]

CFNet [0.785]

KCF [0.740]

DSST [0.740]

0 0.2 0.4 0.6 0.8 1

Overlap threshold

0

0.2

0.4

0.6

0.8

1

Success r

ate

Success plots of OPE

MemDTC* [0.661]

MemTrack* [0.659]

MemDTC [0.658]

SiamRPN [0.658]

DSiamM [0.656]

PTAV [0.654]

MemTrack [0.642]

LMCF [0.628]

SiamFC* [0.618]

SiamFC [0.607]

ACFN [0.607]

Staple [0.600]

CFNet [0.589]

RFL [0.583]

DSST [0.554]

KCF [0.514]

Fig. 7. Precision and success plots on OTB-2013 for real-time trackers.

0 10 20 30 40 50

Location error threshold

0

0.2

0.4

0.6

0.8

1

Pre

cis

ion

Precision plots of OPE

SiamRPN [0.851]

MemDTC* [0.848]

MemDTC [0.845]

MemTrack* [0.844]

PTAV [0.841]

MemTrack [0.820]

DSiamM [0.815]

ACFN [0.799]

LMCF [0.789]

Staple [0.784]

RFL [0.778]

CFNet [0.777]

SiamFC [0.771]

SiamFC* [0.769]

KCF [0.696]

DSST [0.680]

0 0.2 0.4 0.6 0.8 1

Overlap threshold

0

0.2

0.4

0.6

0.8

1

Success r

ate

Success plots of OPE

MemTrack* [0.640]

MemDTC* [0.638]

MemDTC [0.637]

SiamRPN [0.637]

PTAV [0.631]

MemTrack [0.626]

DSiamM [0.605]

SiamFC* [0.588]

CFNet [0.586]

SiamFC [0.582]

Staple [0.581]

RFL [0.581]

LMCF [0.580]

ACFN [0.573]

DSST [0.513]

KCF [0.477]

Fig. 8. Precision and success plot on OTB-2015 for real-time trackers.

improved versions which are MemTrack with auxiliaryclassification loss (MemTrack*), MemTrack with distractortemplate canceling (MemDTC), and MemDTC with auxil-iary classification loss (MemDTC*). We conduct experimentson five challenging datasets: OTB-2013 [53], OTB-2015 [54],VOT-2015 [55], VOT-2016 [56] and VOT-2017 [57]. We fol-low the standard protocols, and evaluate using precisionand success plots, as well as area-under-the-curve (AUC)on OTB datasets. We also present the distance precisionrate, overlap success rate and center location error on OTBfor completeness. For VOT datasets, we use the toolbox2

provided by VOT committee to generate the results.

5.1 OTB datasetsOn OTB-2013 and OTB-2015, we compare our proposedtrackers with 12 recent real-time methods (≥ 15 fps):SiamRPN [58], DSiamM [14], PTAV [59], CFNet [17], LMCF[60], ACFN [61], RFL [19], SiamFC [13], SiamFC* [17], Staple[62], DSST [63], and KCF [64] . To further show our trackingaccuracy, on OTB-2015, we also compared with another 8recent state-of-the art trackers that are not real-time speed:CREST [10], CSR-DCF [65], MCPF [66], SRDCFdecon [67],SINT [15], SRDCF [68], HDT [69], HCF [49].

The OTB-2013 [53] dataset contains 51 sequences with11 video attributes and two evaluation metrics, which arecenter location error and overlap ratio. The OTB-2015 [54]dataset is the extension of OTB-2013 to 100 sequences, andis thus more challenging. We conduct the ablation studymainly on OTB-2015 since it contains OTB-2013.

5.1.1 Comparison to real-time trackersFigure 7 shows the one-pass comparison results with recentreal-time trackers on OTB-2013. Our newly proposed track-ers MemDTC, MemTrack* and MemDTC* rank the threebest AUC scores on the success plot, which all outperformour earlier work MemTrack [22]. For the precision plot

2. https://github.com/votchallenge/vot-toolkit

Page 9: JOURNAL OF LA Visual Tracking via Dynamic Memory Networks · JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 1 Visual Tracking via Dynamic Memory Networks Tianyu Yang and

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 9

MemDTC* MemDTC MemTrack* MemTrack SiamRPN DSiamM PTAV LMCF ACFN SiamFC SiamFC* RFL Staple CFNet KCF DSST

DP @10 (%) (↑) I 75.9 74.8 76.4 73.9 76.9 75.8 74.7 73.6 74.4 69.8 71.4 63.0 69.5 66.9 59.2 61.9II 72.5 71.6 72.7 69.5 71.3 65.7 70.7 67.4 68.1 63.9 67.1 63.5 67.6 66.0 55.7 58.1

DP @20 (%) (↑) I 88.4 86.6 87.1 84.9 88.4 89.1 87.9 84.2 86.0 80.9 80.6 78.6 79.3 78.5 74.0 74.0II 84.8 84.5 84.4 82.0 85.1 81.5 84.1 78.9 79.9 77.1 76.9 77.8 78.4 77.7 69.6 68.0

DP @30 (%) (↑) I 91.5 89.7 90.2 88.7 92.0 91.7 90.3 87.0 89.1 83.9 85.0 82.8 81.1 81.7 78.3 76.4II 88.1 87.8 87.9 85.6 88.8 86.8 88.0 81.8 84.4 81.1 80.9 81.7 80.6 81.3 74.0 71.3

OS @0.3 (%) (↑) I 90.5 90.2 89.7 88.9 92.0 92.0 89.3 86.0 86.5 83.9 84.9 82.9 80.0 81.0 73.0 74.5II 87.7 88.0 87.9 86.6 89.5 86.3 87.8 79.8 81.9 81.2 80.9 82.5 79.8 81.5 68.1 69.1

OS @0.5 (%) (↑) I 84.7 84.6 84.5 80.9 85.7 84.1 81.3 80.0 75.0 77.9 78.3 74.3 75.4 75.2 62.3 67.0II 80.6 80.3 80.8 78.3 81.9 76.0 76.8 71.9 69.2 73.0 73.6 73.0 70.9 73.7 55.1 60.1

OS @0.7 (%) (↑) I 59.0 60.0 60.4 57.3 56.8 56.1 59.9 56.5 49.0 55.1 55.0 48.0 57.9 51.3 39.3 51.8II 55.7 55.7 56.2 54.6 53.9 48.5 53.4 50.2 45.1 50.3 50.8 47.7 51.7 50.1 35.3 46.0

CLE (pixel) (↓) I 16.9 21.5 19.1 27.6 14.2 13.8 19.3 23.8 18.7 29.7 35.2 35.7 30.6 40.3 35.5 41.4II 20.3 21.8 22.1 27.8 19.2 22.8 19.8 39.0 25.2 33.1 35.9 35.8 31.4 34.8 44.7 50.3

TABLE 1Comparison results on OTB-2013 (I) and OTB-2015 (II). DP @n is the distance precision rate at the threshold of n pixels and OS @s is the

overlap success rate at the threshold of s overlap ratio. CLE is center location error. The best result is bolded, and second best is underlined. Theup arrows indicate higher values are better for that metric, while down arrows mean lower values are better.

with center location error, these three trackers also surpassMemTrack by a large margin. Compared with SiamFC [13],which is the baseline for matching-based methods withoutonline updating, the proposed MemDTC*, MemDTC andMemTrack* achieve an improvement of 9.3%, 7.0% and 7.7%on the precision plot, and 8.9%, 8.4% and 8.6% on thesuccess plot. Our methods also outperforms SiamFC*, theimproved version of SiamFC [17] that uses linear interpola-tion of the old and new filters with a small learning rate foronline updating. This indicates that our dynamic memorynetworks can handle object appearance changes better thansimply interpolating new templates with old ones.

Figure 8 presents the precision plot and success plotfor recent real-time trackers on OTB-2015. Our newly pro-posed trackers outperform all other methods on successplot with AUC score. Specifically, our methods performmuch better than RFL [19], which uses the memory statesof the LSTM to maintain the object appearance variations.This demonstrates the effectiveness of using an externaladdressable memory to manage object appearance changes,compared with using LSTM memory which is limited bythe size of the hidden states. Furthermore, the proposedMemDTC*, MemDTC and MemTrack* improve over thebaseline SiamFC [13] by 10.0%, 9.6% and 9.5% on the pre-cision plot, and 9.6%, 9.5% and 10.0% on the success plot.Our trackers MemTrack* and MemDTC* also outperformthe most recently proposed three trackers, SiamRPN [58],DSiamM [14], PTAV [59], on AUC score.

Table 1 shows the quantitative results of the distanceprecision (DP) rate at different pixel thresholds (10, 20, 30),overlap success (OS) rate at different overlap ratios (0.3,0.5, 0.7), and center location errors (CLE) on both OTB-2013(I) and OTB-2015 (II). Our improved trackers MemDTC*,MemDTC and MemTrack* consistently outperform our ear-lier work MemTrack [22] on all measures. In addition, theyalso performs well when the success condition is morestrict (DP @10 and OS @0.7), indicating that their estimatedbounding boxes are more accurate.

Figure 10 further shows the AUC scores of real-timetrackers on OTB-2015 under different video attributes in-cluding out-of-plane rotation, occlusion, motion blur, fastmotion, in-plane rotation, out of view, background clut-ter and low resolution. Our MemTrack* outperforms othertrackers on motion blur, fast motion and out of view, demon-strating its ability of adapting to appearance variationsusing multiple memory slots. In addition, our MemTrackalso shows superior accuracy on low-resolution attributes.Figure 11 shows qualitative results of our trackers compared

0 0.2 0.4 0.6 0.8 1

Overlap threshold

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Success r

ate

Success plots of OPE

MemTrack* [0.640]MemDTC* [0.638]MemDTC [0.637]MCPF [0.628]SRDCFdecon [0.627]MemTrack [0.626]CREST [0.623]SRDCF [0.598]SINT [0.592]CSR-DCF [0.585]HDT [0.564]HCF [0.562]

0 50 100 150 200 250 300

Speed (fps)

0.44

0.46

0.48

0.5

0.52

0.54

0.56

0.58

0.6

0.62

0.64

AU

C s

core

AUC vs Speed

MemTrack*MemDTC*MemDTCMCPFSRDCFdeconMemTrackCRESTSRDCFSINTCSR-DCFHDTHCFCFNetSiamFCStapleRFLLMCFACFNDSSTKCF

Fig. 9. (left) Success plot on OTB-2015 comparing our real-time meth-ods with recent non-real-time trackers. (right) AUC score vs. speed withrecent trackers.

with 6 real-time trackers.

5.1.2 Comparison to non-real-time trackers

Figure 9 presents the comparison results of 8 recent state-of-the-art non-real time trackers for AUC score (left), andthe AUC score vs. speed (right) of all trackers. Our newlyproposed trackers MemDTC*, MemDTC and MemTrack*,which run in real-time (40 fps for MemDTC* and MemDTC,50 fps for MemTrack*), outperform CREST [10], MCPF [66]and SRDCFdecon [67], which all run at ∼1 fps. Moreover,our earlier work MemTrack also surpasses SINT, which isanother matching-based method with optical flow as motioninformation, in terms of both accuracy and speed.

5.2 VOT datasets

The VOT-2015 dataset [55] contains 60 video sequences withper-frame annotated visual attributes. Objects are markedwith rotated bounding boxes to better fit their shapes. TheVOT-2016 dataset [56] uses the same sequences as in VOT-2015 but re-annotates the ground truth bounding boxesin an automatic way with per-frame segmentation masks.The VOT-2017 dataset [57] replaces the least challengingsequences in VOT-2016 with 10 new videos and manuallyfixes the bounding boxes that were incorrectly placed byautomatic methods introduced in VOT-2016.

Tracker performance is evaluated using three metrics:expected average overlap (EAO), accuracy, and robustness.The expected average overlap is computed by averagingthe average overlap on a large set of sequence clips withdifferent predefined lengths for all videos. The accuracymeasures how well the bounding box estimated by thetracker fits the ground truth box and the robustness mea-sures the frequency of tracking failure during tracking.

Page 10: JOURNAL OF LA Visual Tracking via Dynamic Memory Networks · JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 1 Visual Tracking via Dynamic Memory Networks Tianyu Yang and

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 10

0 0.2 0.4 0.6 0.8 1Overlap threshold

0

0.2

0.4

0.6

0.8

1S

uccess r

ate

Success plots of OPE - out-of-plane rotation (63)

SiamRPN [0.631]

MemTrack* [0.623]

MemDTC* [0.621]

MemDTC [0.619]

MemTrack [0.605]

PTAV [0.601]

DSiamM [0.599]

CFNet [0.558]

SiamFC [0.558]

LMCF [0.557]

SiamFC* [0.547]

RFL [0.547]

ACFN [0.543]

Staple [0.534]

DSST [0.470]

KCF [0.453]

0 0.2 0.4 0.6 0.8 1Overlap threshold

0

0.2

0.4

0.6

0.8

1

Su

ccess r

ate

Success plots of OPE - occlusion (49)

PTAV [0.614]

MemTrack* [0.609]

MemDTC [0.604]

MemDTC* [0.601]

SiamRPN [0.592]

MemTrack [0.581]

DSiamM [0.575]

LMCF [0.556]

Staple [0.548]

SiamFC [0.543]

RFL [0.541]

SiamFC* [0.540]

ACFN [0.538]

CFNet [0.536]

DSST [0.453]

KCF [0.443]

0 0.2 0.4 0.6 0.8 1Overlap threshold

0

0.2

0.4

0.6

0.8

1

Su

ccess r

ate

Success plots of OPE - motion blur (29)

MemTrack* [0.640]

SiamRPN [0.627]

MemDTC [0.625]

MemDTC* [0.624]

MemTrack [0.611]

PTAV [0.602]

CFNet [0.584]

RFL [0.573]

DSiamM [0.562]

LMCF [0.561]

ACFN [0.561]

SiamFC [0.550]

Staple [0.546]

SiamFC* [0.514]

DSST [0.469]

KCF [0.459]

0 0.2 0.4 0.6 0.8 1Overlap threshold

0

0.2

0.4

0.6

0.8

1

Su

ccess r

ate

Success plots of OPE - fast motion (39)

MemTrack* [0.633]

MemDTC* [0.630]

MemDTC [0.626]

MemTrack [0.623]

SiamRPN [0.606]

RFL [0.602]

PTAV [0.595]

CFNet [0.583]

DSiamM [0.579]

SiamFC [0.568]

ACFN [0.561]

SiamFC* [0.558]

LMCF [0.551]

Staple [0.537]

KCF [0.459]

DSST [0.447]

0 0.2 0.4 0.6 0.8 1Overlap threshold

0

0.2

0.4

0.6

0.8

1

Succe

ss r

ate

Success plots of OPE - in-plane rotation (51)

SiamRPN [0.636]

MemDTC* [0.625]

MemTrack* [0.623]

MemDTC [0.613]

MemTrack [0.606]

DSiamM [0.599]

PTAV [0.598]

CFNet [0.590]

RFL [0.574]

SiamFC* [0.572]

SiamFC [0.557]

Staple [0.552]

LMCF [0.543]

ACFN [0.543]

DSST [0.502]

KCF [0.469]

0 0.2 0.4 0.6 0.8 1Overlap threshold

0

0.2

0.4

0.6

0.8

1

Succe

ss r

ate

Success plots of OPE - out of view (14)

MemTrack* [0.594]

MemDTC [0.590]

MemDTC* [0.588]

PTAV [0.570]

SiamRPN [0.550]

MemTrack [0.549]

LMCF [0.539]

RFL [0.532]

DSiamM [0.509]

SiamFC [0.506]

ACFN [0.494]

Staple [0.481]

CFNet [0.480]

SiamFC* [0.423]

KCF [0.393]

DSST [0.386]

0 0.2 0.4 0.6 0.8 1Overlap threshold

0

0.2

0.4

0.6

0.8

1

Succe

ss r

ate

Success plots of OPE - background clutter (31)

PTAV [0.649]

MemTrack* [0.621]

MemDTC* [0.615]

MemDTC [0.610]

LMCF [0.606]

SiamRPN [0.601]

MemTrack [0.599]

DSiamM [0.589]

RFL [0.575]

Staple [0.574]

SiamFC* [0.565]

CFNet [0.543]

ACFN [0.536]

DSST [0.523]

SiamFC [0.523]

KCF [0.498]

0 0.2 0.4 0.6 0.8 1Overlap threshold

0

0.2

0.4

0.6

0.8

1

Succe

ss r

ate

Success plots of OPE - low resolution (9)

MemTrack [0.684]

SiamRPN [0.666]

MemDTC [0.665]

MemTrack* [0.659]

DSiamM [0.657]

MemDTC* [0.644]

SiamFC [0.618]

SiamFC* [0.586]

CFNet [0.582]

RFL [0.573]

ACFN [0.515]

PTAV [0.496]

Staple [0.396]

LMCF [0.385]

DSST [0.370]

KCF [0.290]

Fig. 10. The success plots on OTB-2015 for eight challenging attributes: out-of-plane rotation, occlusion, motion blur, fast motion, in-plane rotation,out of view, background clutter and low resolution.

Fig. 11. Qualitative results of our MemTrack, along with SiamFC [13], RFL [19], CFNet [17], Staple [62], LMCF [60], ACFN [61] on eight challengesequences. From left to right, top to bottom: board, bolt2, dragonbaby, lemming, matrix, skiing, biker, girl2.

5.2.1 VOT-2015 ResultsThere are 41 trackers submitted to VOT-2015 and 21 baselinetrackers contributed by the VOT-2015 committee. Table 2presents the detailed comparison results with the best 25performing methods (according to EAO, expected averageoverlap). Our newly proposed tracker MemDTC* achievesthe third and fourth place in terms of accuracy and EAO.Note that MDNet [11] achieves the best score for all metrics,which is however much slower than our MemTrack*. Ourmethods also runs much faster (higher EFO3) than Deep-SRDCF [70] and EBT [71], which are slightly better than our

3. EFO (equivalent filter operations) is a measurement of speedgenerated by the VOT toolkit automatically, and is similar to fps butis a relative value. For example, SiamFC gets 32 EFO (VOT) vs. 86 fps(original) in Table 4, while ours is 24 EFO (VOT) vs. 50 fps (original).

MemTrack*. Moreover, SRDCF [68], LDP [55] and sPST [72],which outperform our MemTrack, do not have real-timespeed. Figure 12 shows the accuracy-robustness rankingplot (left) and EAO vs. EFO plot (right) on the VOT-2015dataset. Our methods perform favorably against state-of-the-art trackers in terms of both accuracy and robustness(see upper right corner), while maintaining real-time speed(∼20 EFO). Finally, MemTrack and MemTrack* have slightlyworse EAO than the baseline SiamFC on VOT-2015, whichcould be an artifact of the noisy ground-truth boundingboxes in VOT-2015. On VOT-2016, which contains the samevideos as VOT-2015 and has improved ground-truth anno-tations, MemTrack and MemTrack* outperform SiamFC bya large margin (see Section 5.2.2).

Page 11: JOURNAL OF LA Visual Tracking via Dynamic Memory Networks · JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 1 Visual Tracking via Dynamic Memory Networks Tianyu Yang and

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 11

46810121416

Robustness rank

2

4

6

8

10

12

Accu

racy r

an

k

Ranking Plot

0 50 100 150

EFO

0.2

0.22

0.24

0.26

0.28

0.3

0.32

0.34

0.36

0.38

EA

O

EAO vs EFO

AOG ASMS DAT DeepSRDCF EBT HMMTxD LDP MCT

MDNet MEEM MKCF+ MemDTC MemDTC* MemTrack MemTrack* MvCFT

NSAMF OACF RAJSSC RobStruck S3Tracker SC-EBT SME SODLT

SRDCF Struck SumShift TRIC-track sPST

Fig. 12. The AR rank plots and EAO vs. EFO for VOT-2015. Our methodsare colored with red, while the top-10 methods are marked with blue.Others are colored with gray.

EAO (↑) Acc.(↑) Rob. (↓) EFO (↑)MDNet 0.3783 0.6033 0.6936 0.97

DeepSRDCF 0.3181 0.5637 1.0457 0.26EBT 0.3130 0.4732 1.0213 2.74

MemDTC* 0.3005 0.5646 1.4610 22.18MemDTC 0.2948 0.5509 1.6365 21.95SiamFC4 0.2889 0.5335 - -SRDCF 0.2877 0.5592 1.2417 1.36

MemTrack* 0.2842 0.5573 1.6768 26.24LDP 0.2785 0.4890 1.3332 5.17sPST 0.2767 0.5473 1.4796 1.16

MemTrack 0.2753 0.5582 1.7286 26.11SC-EBT 0.2548 0.5529 1.8587 1.83NSAMF 0.2536 0.5305 1.2921 6.81Struck 0.2458 0.4712 1.6097 3.52

RAJSSC 0.2420 0.5659 1.6296 2.67S3Tracker 0.2403 0.5153 1.7680 20.04SumShift 0.2341 0.5169 1.6815 23.55SODLT 0.2329 0.5607 1.7769 1.14

DAT 0.2238 0.4856 2.2583 14.87MEEM 0.2212 0.4993 1.8535 3.66

RobStruck 0.2198 0.4793 1.4724 1.67OACF 0.2190 0.5751 1.8128 9.88MCT 0.2188 0.4703 1.7609 3.98

HMMTxD 0.2185 0.5278 2.4835 2.17ASMS 0.2117 0.5066 1.8464 142.26

MKCF+ 0.2095 0.5153 1.8318 1.79TRIC-track 0.2088 0.4618 2.3426 0.03

AOG 0.2080 0.5067 1.6727 1.26SME 0.2068 0.5528 1.9763 5.77

MvCFT 0.2059 0.5220 1.7220 11.85

TABLE 2Results on VOT-2015. The evaluation metrics include expected average

overlap (EAO), accuracy value (Acc.), robustness value (Rob.) andequivalent filter operations (EFO). The top three performing trackers are

colored with red, green and blue respectively. The up arrows indicatehigher values are better for that metric, while down arrows mean lower

values are better. Real-time methods (>15 EFO) are underlined.

5.2.2 VOT-2016 ResultsTogether 48 tracking methods are submitted to the VOT-2016 challenge and 22 baseline algorithms are provided bythe VOT-2016 committee and associates. Table 3 summarizesthe detailed comparison results with the top 25 performingtrackers. Overall, CCOT [73] achieves the best results onEAO, SSAT [56] obtains the best performance on accuracy

4. Results are obtained from the original SiamFC paper [13].

51015

Robustness rank

2

4

6

8

10

12

Accura

cy r

ank

Ranking Plot

0 20 40 60 80 100 120

EFO

0.22

0.24

0.26

0.28

0.3

0.32

0.34

EA

O

EAO vs EFO

CCOT ColorKCF DDC DNT DPT DeepSRDCF EBT FCF

GGTv2 HMMTxD MDNet_N MLDF MemDTC MemDTC* MemTrack MemTrack*

NSAMF RFD_CF2 SHCT SRBT SRDCF SSAT SSKCF STAPLE+

SiamAN SiamRN Staple TCNN deepMKCF

Fig. 13. The AR rank plots and EAO vs. EFO for VOT-2016. See thecaption of Figure 12 for description of the colors.

value, and TCNN [18] outperforms all other trackers onrobustness value. Our MemDTC* ranks the 5th over EAO,and surpasses other variants MemTrack, MemTrack* andMemDTC by a large margin. With the use of deeper net-works, VGG, SSAT and MLDF achieve better EAO com-pared with MemDTC*, which is however at the cost of con-suming considerable computation leading to non-realtimespeed. It is worth noting that the proposed MemDTC* runsmuch faster than those trackers that rank ahead of it. Figure13 shows the accuracy-robustness ranking plot (left) andEAO vs. EFO plot (right) on the VOT-2016 dataset. Ouralgorithm MemDTC* achieves better robustness rank thanthe other three variants MemDTC, MemTrack* and Mem-Track. As reported in VOT-2016, the SOTA bound is EAO0.251, which all our trackers exceed. In addition, our trackersoutperform the baseline matching-based tracker SiamAN(SiamFC with AlexNet), SiamRN (SiamFC with ResNet) andRFL [19] over all evaluation metrics.

5.2.3 VOT-2017 ResultsThere are 38 valid entries submitted to the VOT-2017 chal-lenge and 13 baseline trackers contributed by the VOT-2017committee and associates. Table 4 shows the comparisonresults of the top 25 performing trackers, as well as ourproposed methods. The winner of VOT-2017 is LSART[74], which utilizes a weighted cross-patch similarity kernelfor kernelized ridged regression and a fully convolutionalneural network with spatially regularized kernels. Howeverdue to heavy model fusing and the use of deeper networks(VGGNet), it runs at ∼2 fps. ECO [75], which ranks thefourth place on EAO, improves CCOT [73] over both perfor-mance and speed by introducing a factorized convolutionoperator. However the speed of ECO is still far from real-time even though it is faster than CCOT. CFWCR [57] adoptsECO as the baseline and further boosts it by using morelayers for feature fusing. CFCF [76] also utilizes ECO asthe baseline tracker and improves it by selecting differentlayer features (first, fifth and sixth) of a newly trained fullyconvolutional network on ILSVRC [50] as the input of ECO.In addition, Gnet [57] integrates GoogLeNet features intoSRDCF [68] and ECO [75]. Since those trackers that rankahead of our MemDTC* are usually based on either ECO orSRDCF using deeper networks (like VGG), which are thusnot real-time, our tracker performs favorably against thesetop performing trackers, while retaining real-time speed.Furthermore, our MemDTC* outperforms both our earlierwork MemTrack [22] and the baseline tracker SiamFC [13].

Page 12: JOURNAL OF LA Visual Tracking via Dynamic Memory Networks · JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 1 Visual Tracking via Dynamic Memory Networks Tianyu Yang and

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 12

EAO (↑) Acc.(↑) Rob. (↓) EFO (↑)CCOT 0.3305 0.5364 0.8949 0.51TCNN 0.3242 0.5530 0.8302 1.35SSAT 0.3201 0.5764 1.0462 0.80

MLDF 0.3093 0.4879 0.9236 2.20MemDTC* 0.2976 0.5297 1.3106 22.30

Staple 0.2940 0.5406 1.4158 14.43DDC 0.2928 0.5363 1.2656 0.16EBT 0.2909 0.4616 1.0455 2.87

SRBT 0.2897 0.4949 1.3314 2.90STAPLE+ 0.2849 0.5537 1.3094 18.12

DNT 0.2781 0.5136 1.2004 1.88SiamRN 0.2760 0.5464 1.3617 7.05

DeepSRDCF 0.2759 0.5229 1.2254 0.38SSKCF 0.2747 0.5445 1.4299 44.06

MemTrack 0.2723 0.5273 1.4381 24.60MemTrack* 0.2713 0.5378 1.4736 24.22MemDTC 0.2679 0.5109 1.8287 21.42

SHCT 0.2654 0.5431 1.3902 0.54MDNet N 0.2569 0.5396 0.9123 0.69

FCF 0.2508 0.5486 1.8460 2.39SRDCF 0.2467 0.5309 1.4332 1.99

RFD CF2 0.2414 0.4728 1.2697 1.20GGTv2 0.2373 0.5150 1.7334 0.52

SiamAN 0.2345 0.5260 1.9093 11.93DPT 0.2343 0.4895 1.8509 4.03

deepMKCF 0.2320 0.5405 1.2271 1.89HMMTxD 0.2311 0.5131 2.1444 4.99NSAMF 0.2267 0.4984 1.2536 6.61

ColorKCF 0.2257 0.5003 1.5009 111.39

TABLE 3Results on VOT-2016. See the caption of Table 2 for more information.

468101214

Robustness rank

3

4

5

6

7

8

9

10

11

Accura

cy r

ank

Ranking Plot

0 20 40 60 80 100 120 140

EFO

0.16

0.18

0.2

0.22

0.24

0.26

0.28

0.3

0.32

EA

O

EAO vs EFO

ASMS ATLAS CCOT CFCF CFWCR CRT CSRDCF CSRDCFf

DACF DLST ECO ECOhc FSTC Gnet LSART MCCT

MCPF MEEM MemDTC MemDTC* MemTrack MemTrack* RCPF SAPKLTF

SPCT SiamDCF SiamFC Staple UCT

Fig. 14. The AR rank plots and EAO vs. EFO for VOT-2017. Our methodsare colored with red while the top 10 methods are marked with blue.Others are colored with gray.

Figure 14 shows the accuracy-robustness ranking plot (left)and EAO vs. EFO plot (right) on the VOT-2017 dataset. Ourmethods perform favorably against state-of-the-art trackersin terms of both accuracy and robustness, and has the bestperformance among real-time trackers.

5.3 Ablation StudiesOur preliminary tracker MemTrack [22] contains three im-portant components: 1) an attention mechanism, which cal-culates the attended feature vector for memory reading; 2)a dynamic memory network, which maintains the target’sappearance variations; and 3) residual template learning,which controls the amount of model updating for each chan-

EAO (↑) Acc.(↑) Rob. (↓) EFO (↑)LSART 0.3211 0.4913 0.9432 1.72CFWCR 0.2997 0.4818 1.2103 1.80

CFCF 0.2845 0.5042 1.1686 0.85ECO 0.2803 0.4806 1.1167 3.71Gnet 0.2723 0.4992 0.9973 1.29

MCCT 0.2679 0.5198 1.1258 1.32CCOT 0.2658 0.4887 1.3153 0.15

MemDTC* 0.2651 0.4909 1.5287 21.12CSRDCF 0.2541 0.4835 1.3095 8.75

MemDTC 0.2504 0.4924 1.7730 20.49SiamDCF 0.2487 0.4956 1.8659 10.73

MCPF 0.2477 0.5081 1.5903 0.42CRT 0.2430 0.4613 1.2367 3.24

MemTrack 0.2427 0.4935 1.7735 24.27MemTrack* 0.2416 0.5025 1.8058 24.74

ECOhc 0.2376 0.4905 1.7737 17.71DLST 0.2329 0.5051 1.5667 1.89DACF 0.2278 0.4498 1.3211 21.96

CSRDCFf 0.2257 0.4712 1.3905 15.05RCPF 0.2144 0.5001 1.5892 0.42UCT 0.2049 0.4839 1.8307 12.20SPCT 0.2025 0.4682 2.1547 4.40

ATLAS 0.1953 0.4821 2.5702 5.21MEEM 0.1914 0.4548 2.1111 4.12FSTC 0.1878 0.4730 1.9235 0.96

SiamFC 0.1876 0.4945 2.0485 31.89SAPKLTF 0.1835 0.4764 2.2002 31.65

ASMS 0.1687 0.4868 2.2496 130.02Staple 0.1685 0.5194 2.5068 47.01

TABLE 4Results on VOT-2017. See the caption of Table 2 for more information.

0 0.2 0.4 0.6 0.8 1

Overlap threshold

0

0.2

0.4

0.6

0.8

1

Success r

ate

Success plots of OPE

MemTrack* [0.640]

MemDTC* [0.638]

MemDTC [0.637]

MemTrack [0.626]

MemTrack-NoAtt [0.611]

MemTrack-NoRes [0.603]

MemTrack-HardRead [0.600]

MemTrack-Queue [0.581]

0 0.2 0.4 0.6 0.8 1

Overlap threshold

0

0.2

0.4

0.6

0.8

1

Success r

ate

Success plots of OPE

MemDTC-M16 [0.637]

MemDTC-M32 [0.636]

MemDTC-M4 [0.631]

MemTrack-M8 [0.626]

MemTrack-M16 [0.625]

MemDTC-M8 [0.623]

MemDTC-M2 [0.622]

MemDTC-M1 [0.617]

MemTrack-M4 [0.607]

MemTrack-M2 [0.590]

MemTrack-M1 [0.586]

Fig. 15. Ablation studies: (left) success plots of different variants of ourtracker on OTB-2015; (right) success plots for different memory sizes{1, 2, 4, 8, 16} for positive memory slot, {1, 2, 4, 8, 16, 32} for negativememory slot on OTB-2015.

nel of the template. To evaluate their separate contributionsto our tracker, we implement several variants of our methodand verify them on OTB-2015 dataset. The results of theablation study are presented in Figure 15 (left).

We first design a variant of MemTrack without atten-tion mechanism (MemTrack-NoAtt), which averages all Lfeature vectors to get the feature vector at for the LSTMinput. Mathematically, it changes (2) to at = 1

L

∑Li=1 f

∗t,i.

Memtrack-NoAtt decreases performance (see Figure 15 left),which shows the benefit of using attention to roughly local-ize the target in the search image. We also design a naivestrategy that simply writes the new target template sequen-tially into the memory slots as a queue (MemTrack-Queue).When the memory is fully occupied, the oldest templatewill be replaced with the new template. The retrieved tem-

Page 13: JOURNAL OF LA Visual Tracking via Dynamic Memory Networks · JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 1 Visual Tracking via Dynamic Memory Networks Tianyu Yang and

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 13

plate is generated by averaging all templates stored in thememory slots. This simple approach cannot produce goodperformance (Figure 15 left), which shows the necessity ofour dynamic memory network. We devise a hard templatereading scheme (MemTrack-HardRead), i.e., retrieving asingle template by max cosine distance, to replace the softweighted sum reading scheme. This design decreases theperformance (Figure 15 left), most likely because the non-differentiable reading leads to an inferior model. To verifythe effectiveness of gated residual template learning, wedesign another variant of MemTrack— removing channel-wise residual gates (MemTrack-NoRes), i.e. directly addingthe retrieved and initial templates to get the final template.Our gated residual template learning mechanism boosts theperformance (Figure 15 left) as it helps to select correctresidual channel features for template updating.

In this paper, we improve MemTrack with two tech-niques: distractor template canceling and auxiliary classi-fication loss. Our newly proposed methods MemTrack*,MemDTC and MemDTC* consistently outperform our ear-lier work MemTrack (Figure 15 left). Without the auxil-iary classification loss, MemDTC outperforms MemTrackon both the precision and success plots of OTB-2013/2015,which demonstrates its effectiveness of the distractor tem-plate canceling strategy. When using the auxiliary classifi-cation loss, MemDTC* has slightly better performance thatMemTrack* on OTB-2013, but slightly worse performanceon OTB-2015. It is possible that the discrimination abilityof the feature extractor (only a 5-layer CNN) limits theperformance gain of the auxiliary loss. We also note thatMemTrack* achieves slightly worse EAO than MemTrack onVOT-2016/2017, while MemDTC* is better than MemDTCon VOT-2015/2016/2017. On these datasets, the auxiliarytask cannot further improve the performance until the dis-tractor template canceling scheme is used.

We also investigate the effect of memory size on trackingperformance. Figure 15 (right) shows the success plot onOTB-2015 using different numbers of memory slots. ForMemTrack, tracking accuracy increases along with the mem-ory size and saturates at 8 memory slots. Considering theruntime and memory usage, we choose 8 as the defaultnumber of positive memory slots. For our improved trackerMemDTC, we keep the number of positive memory slotfixed at 8, and change the number of negative memoryslots. The tracking performance increases with the numberof negative memory slot, and saturates at 16.

6 CONCLUSION

In this paper, we propose a dynamic memory network withan external addressable memory block for visual tracking,aiming to adapt matching templates to object appearancevariations. An LSTM with attention scheme controls thememory access by parameterizing the memory interactions.We develop channel-wise gated residual template learningto form the positive matching model, which preserves theconservative information present in the initial target, whileproviding online adapability of each feature channel. Toalleviate the drift problem caused by distractor targets, wedevise a distractor template canceling scheme that inhibitschannels in the final template that are not discriminative.Furthermore, we improve the tracking performance by in-troducing an auxiliary classification loss branch after the

feature extractor, aiming to learn semantic features thatcomplement the features trained by similarity matching.Once the offline training process is finished, no online fine-tuning is needed, which leads to real-time speed. Extensiveexperiments on standard tracking benchmark demonstratesthe effectiveness of our proposed trackers.

ACKNOWLEDGMENTS

This work was supported by grants from the ResearchGrants Council of the Hong Kong Special AdministrativeRegion, China (CityU 11200314, and CityU 11212518). Wegrateful for the support of NVIDIA Corporation with thedonation of the Tesla K40 GPU used for this research.

REFERENCES

[1] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classifi-cation with deep convolutional neural networks,” in NIPS, 2012.

[2] C. Szegedy, W. Liu, Y. Jia, and P. Sermanet, “Going deeper withconvolutions,” in CVPR, 2015.

[3] K. He, X. Zhang, S. Ren, and J. Sun, “Deep Residual Learning forImage Recognition,” in CVPR, 2016.

[4] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature hier-archies for accurate object detection and semantic segmentation,”in CVPR, 2014.

[5] R. Girshick, “Fast R-CNN,” in ICCV, 2015.[6] S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: Towards

Real-Time Object Detection with Region Proposal Networks,” inNIPS, 2015.

[7] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional net-works for semantic segmentation,” in CVPR, 2015.

[8] H. Noh, S. Hong, and B. Han, “Learning deconvolution networkfor semantic segmentation,” in ICCV, 2016.

[9] Y. Li, H. Qi, J. Dai, X. Ji, and Y. Wei, “Fully Convolutional Instance-aware Semantic Segmentation,” in CVPR, 2017.

[10] Y. Song, C. Ma, L. Gong, J. Zhang, R. Lau, and M.-H. Yang,“CREST: Convolutional Residual Learning for Visual Tracking,”in ICCV, 2017.

[11] H. Nam and B. Han, “Learning Multi-Domain ConvolutionalNeural Networks for Visual Tracking,” in CVPR, 2016.

[12] L. Wang, W. Ouyang, X. Wang, and H. Lu, “Visual Tracking withFully Convolutional Networks,” in ICCV, 2015.

[13] L. Bertinetto, J. Valmadre, J. F. Henriques, A. Vedaldi, and P. H. S.Torr, “Fully-Convolutional Siamese Networks for Object Track-ing,” in ECCVW, 2016.

[14] Q. Guo, W. Feng, C. Zhou, R. Huang, L. Wan, and S. Wang,“Learning Dynamic Siamese Network for Visual Object Tracking,”in ICCV, 2017.

[15] R. Tao, E. Gavves, and A. W. M. Smeulders, “Siamese InstanceSearch for Tracking,” in CVPR, 2016.

[16] D. Held, S. Thrun, and S. Savarese, “Learning to Track at 100 FPSwith Deep Regression Networks,” in ECCV, 2016.

[17] J. Valmadre, L. Bertinetto, F. Henriques, A. Vedaldi, and P. H. S.Torr, “End-to-end representation learning for Correlation Filterbased tracking,” in CVPR, 2017.

[18] H. Nam, M. Baek, and B. Han, “Modeling and Propagating CNNsin a Tree Structure for Visual Tracking,” in arXiv, 2016.

[19] T. Yang and A. B. Chan, “Recurrent Filter Learning for VisualTracking,” in ICCVW, 2017.

[20] A. He, C. Luo, X. Tian, and W. Zeng, “A Twofold Siamese Networkfor Real-Time Object Tracking,” in CVPR, 2018.

[21] Q. Wang, Z. Teng, J. Xing, J. Gao, W. Hu, and S. Maybank,“Learning Attentions : Residual Attentional Siamese Network forHigh Performance Online Visual Tracking,” in CVPR, 2018.

[22] T. Yang and A. B. Chan, “Learning Dynamic Memory Networksfor Object Tracking,” in ECCV, 2018.

[23] H. Grabner, C. Leistner, and H. Bischof, “Semi-Supervised On-lineBoosting for Robust Tracking,” in ECCV, 2008.

[24] B. Babenko, S. Member, M.-h. Yang, and S. Member, “RobustObject Tracking with Online Multiple Instance Learning,” TPAMI,2011.

[25] Z. Kalal, K. Mikolajczyk, and J. Matas, “Tracking-Learning-Detection,” TPAMI, 2012.

[26] P. Li, D. Wang, L. Wang, and H. Lu, “Deep visual tracking: Reviewand experimental comparison,” Pattern Recognition, 2018.

Page 14: JOURNAL OF LA Visual Tracking via Dynamic Memory Networks · JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 1 Visual Tracking via Dynamic Memory Networks Tianyu Yang and

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 14

[27] N. Srivastava, G. E. Hinton, A. Krizhevsky, I. Sutskever, andR. Salakhutdinov, “Dropout : A Simple Way to Prevent NeuralNetworks from Overfitting,” JMLR, 2014.

[28] B. Han, J. Sim, and H. Adam, “BranchOut: Regularization forOnline Ensemble Tracking with Convolutional Neural Networks,”in CVPR, 2017.

[29] C. Huang, S. Lucey, and D. Ramanan, “Learning Policies forAdaptive Tracking with Deep Feature Cascades,” in ICCV, 2017.

[30] Z. Chi, H. Li, H. Lu, and M.-H. Yang, “Dual deep network forvisual tracking,” TIP, 2017.

[31] A. Graves, G. Wayne, and I. Danihelka, “Neural Turing Machines,”Arxiv, 2014.

[32] J. Weston, S. Chopra, and A. Bordes, “Memory Networks,” inICLR, 2015.

[33] S. Sukhbaatar, A. Szlam, J. Weston, and R. Fergus, “End-To-EndMemory Networks,” in NIPS, 2015.

[34] A. Graves, G. Wayne, M. Reynolds, T. Harley, I. Danihelka,A. Grabska-Barwinska, S. Gomez Colmenarejo, E. Grefenstette,T. Ramalho, J. Agapiou, A. P. Badia, K. Moritz Hermann, Y. Zwols,G. Ostrovski, A. Cain, H. King, C. Summerfield, P. Blunsom,K. Kavukcuoglu, and D. Hassabis, “Hybrid computing using aneural network with dynamic external memory,” Nature, 2016.

[35] A. Santoro, S. Bartunov, M. Botvinick, D. Wierstra, and T. Lilli-crap, “One-shot Learning with Memory-Augmented Neural Net-works,” in ICML, 2016.

[36] B. Liu, Y. Wang, Y.-W. Tai, and C.-K. Tang, “MAVOT: Memory-Augmented Video Object Tracking,” arXiv, 2017.

[37] R. Collobert and J. Weston, “A Unified Architecture for Natu-ral Language Processing: Deep Neural Networks with MultitaskLearning,” in ICML, 2008.

[38] L. Deng, G. Hinton, and B. Kingsbury, “New Types of DeepNeural Network Learning for Speech Recognition and RelatedApplications: An Overview,” in ICASSP, 2013.

[39] R. Girshick, “Fast R-CNN,” in ICCV, 2015.[40] R. Caruana, “Multitask Learning,” Machine learning, 1997.[41] Y. Zhang and D.-Y. Yeung, “A Convex Formulation for Learning

Task Relationships in Multi-task Learning,” arXiv, 2012.[42] S. Li, Z.-Q. Liu, and A. B. Chan, “Heterogeneous Multi-task

Learning for Human Pose Estimation with Deep ConvolutionalNeural Network,” IJCV, 2015.

[43] J. Yao, S. Fidler, and R. Urtasun, “Describing the Scene as AWhole: Joint Object Detection, Scene Classification and SemanticSegmentation,” in CVPR, 2012.

[44] K. He, G. Gkioxari, P. Dollar, and R. Girshick, “Mask R-CNN,” inICCV, 2017.

[45] D. Eigen and R. Fergus, “Predicting Depth, Surface Normalsand Semantic Labels with a Common Multi-scale ConvolutionalArchitecture,” in ICCV, 2015.

[46] A. Kendall, Y. Gal, and R. Cipolla, “Multi-task Learning usingUncertainty to Weigh Losses for Scene Geometry and Semantics,”in CVPR, 2018.

[47] J. L. Ba, J. R. Kiros, and G. E. Hinton, “Layer Normalization,”arXiv, 2016.

[48] M. D. Zeiler and R. Fergus, “Visualizing and UnderstandingConvolutional Networks,” in ECCV, 2014.

[49] C. Ma, J.-b. Huang, X. Yang, and M.-h. Yang, “Hierarchical Con-volutional Features for Visual Tracking,” in ICCV, 2015.

[50] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma,Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, andL. Fei-Fei, “ImageNet Large Scale Visual Recognition Challenge,”IJCV, 2015.

[51] D. Kingma and J. Ba, “Adam: A method for stochastic optimiza-tion,” arXiv, 2014.

[52] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro,G. S. Corrado, A. Davis, J. Dean, M. Devin et al., “Tensorflow:Large-scale machine learning on heterogeneous distributed sys-tems,” arXiv, 2016.

[53] Y. Wu, J. Lim, and M.-H. Yang, “Online Object Tracking: A Bench-mark,” in CVPR, 2013.

[54] ——, “Object Tracking Benchmark,” PAMI, 2015.[55] M. Kristan, R. Pflugfelder, A. Leonardis, J. Matas, L. Cehovin,

G. Nebehay, T. Vojı\vr, G. Fernandez, and Others, “The visualobject tracking VOT2015 challenge results,” in ICCVW, 2015.

[56] M. Kristan, A. Leonardis, J. Matas, and M. Felsberg, “The VisualObject Tracking VOT2016 Challenge Results,” in ECCVW, 2016.

[57] M. Kristan, A. Leonardis, J. Matas, M. Felsberg, R. Pflugfelder,L. C. Zajc, T. Vojır, G. Hager, A. Lukezic, A. Eldesokey,G. Fernandez, and Others, “The Visual Object Tracking VOT2017Challenge Results,” in ICCVW, 2017.

[58] B. Li, J. Yan, W. Wu, Z. Zhu, and X. Hu, “High Performance VisualTracking with Siamese Region Proposal Network,” in CVPR, 2018.

[59] H. Fan and H. Ling, “Parallel Tracking and Verifying : A Frame-work for Real-Time and High Accuracy Visual Tracking,” in ICCV,2017.

[60] M. Wang, Y. Liu, and Z. Huang, “Large Margin Object Trackingwith Circulant Feature Maps,” in CVPR, 2017.

[61] J. Choi, H. J. Chang, S. Yun, T. Fischer, Y. Demiris, and Jin YoungChoi, “Attentional Correlation Filter Network for Adaptive VisualTracking,” in CVPR, 2017.

[62] L. Bertinetto, J. Valmadre, S. Golodetz, O. Miksik, and P. Torr,“Staple: Complementary Learners for Real-Time Tracking,” inCVPR, 2016.

[63] M. Danelljan, G. Hager, F. Khan, and M. Felsberg, “Accurate ScaleEstimation for Robust Visual Tracking,” in BMVC, 2014.

[64] J. F. Henriques, R. Caseiro, P. Martins, and J. Batista, “High-speedtracking with kernelized correlation filters,” TPAMI, 2015.

[65] A. Lukezic, T. Vojı, L. Cehovin, J. Matas, and M. Kristan, “Discrim-inative Correlation Filter with Channel and Spatial Reliability,” inCVPR, 2017.

[66] T. Zhang, C. Xu, and M.-h. Yang, “Multi-task Correlation ParticleFilter for Robust Object Tracking,” in CVPR, 2017.

[67] M. Danelljan, G. Hager, F. S. Khan, and M. Felsberg, “AdaptiveDecontamination of the Training Set: A Unified Formulation forDiscriminative Visual Tracking,” in CVPR, 2016.

[68] M. Danelljan, H. Gustav, F. S. Khan, and M. Felsberg, “LearningSpatially Regularized Correlation Filters for Visual Tracking,” inICCV, 2015.

[69] Y. Qi, S. Zhang, L. Qin, H. Yao, Q. Huang, and J. L. M.-h. Yang,“Hedged Deep Tracking,” in CVPR, 2016.

[70] M. Danelljan, G. Hager, F. S. Khan, and M. Felsberg, “Convolu-tional Features for Correlation Filter Based Visual Tracking,” inICCVW, 2016.

[71] G. Zhu, F. Porikli, and H. Li, “Beyond Local Search: TrackingObjects Everywhere with Instance-Specific Proposals,” in CVPR,2016.

[72] Y. Hua, K. Alahari, and C. Schmid, “Online Object Tracking withProposal Selection,” in ICCV, 2015.

[73] M. Danelljan, A. Robinson, F. S. Khan, and M. Felsberg, “Beyondcorrelation filters: Learning continuous convolution operators forvisual tracking,” in ECCV, 2016.

[74] C. Sun, H. Lu, and M.-H. Yang, “Learning Spatial-Aware Regres-sions for Visual Tracking,” in CVPR, 2018.

[75] M. Danelljan, G. Bhat, F. S. Khan, and M. Felsberg, “ECO: EfficientConvolution Operators for Tracking,” in CVPR, 2017.

[76] E. Gundogdu and A. A. Alatan, “Good Features to Correlate forVisual Tracking,” TIP, 2018.

Tianyu Yang received the B.S. degree fromLiaocheng University, China, and the M.Eng.degree from University of Chinese Academy ofSciences, China, in 2010 and 2013, respectively.He is currently a PhD student at City Universityof Hong Kong, China. His current research inter-ests include visual tracking and deep learning.

Antoni B. Chan received the B.S. and M.Eng.degrees in electrical engineering from CornellUniversity, Ithaca, NY, in 2000 and 2001, andthe Ph.D. degree in electrical and computer en-gineering from the University of California, SanDiego (UCSD), San Diego, in 2008. He is cur-rently an Associate Professor in the Departmentof Computer Science, City University of HongKong. His research interests include computervision, machine learning, pattern recognition,and music analysis.