visual tracking using siamese convolutional neural network with region...

11
Neurocomputing 275 (2018) 2645–2655 Contents lists available at ScienceDirect Neurocomputing journal homepage: www.elsevier.com/locate/neucom Visual tracking using Siamese convolutional neural network with region proposal and domain specific updating Han Zhang , Weiping Ni, Weidong Yan, Junzheng Wu, Hui Bian, Deliang Xiang Northwest Institute of Nuclear Technology, Baqiao, Xi’an, 710024, China a r t i c l e i n f o Article history: Received 22 March 2017 Revised 5 November 2017 Accepted 17 November 2017 Available online 29 November 2017 Communicated by Zhu Jianke Keywords: Visual tracking Convolutional neural network Siamese network Region proposal a b s t r a c t This paper deals with the problem of arbitrary object tracking using Siamese convolutional neural net- work (CNN), which is trained to match the initial patch of the target in the first frame with candidates in a new frame. The network returns the most similar candidate with the smallest margin contrastive loss. For candidate proposals in each frame, a Siamese region proposal network is applied to identify potential targets from across the whole frame. It is also able to mine hard negative examples to make the network more discriminative for the specific sequence. The Siamese tracking network and the Siamese region pro- posal network share weights which are trained end-to-end. Taking advantage of the fast implementation of fully convolutional architecture, the Siamese region proposal network does not cost much spare time during online tracking. Although the network is trained to be a generic tracker that can be applied to any video sequence, we find that domain specific network updating with a short- and long-term strat- egy can significantly improve the tracking performance. After combining generic Siamese network train- ing, Siamese region proposal, and domain specific updating, the proposed tracker obtains state-of-the-art tracking performance. © 2017 Elsevier B.V. All rights reserved. 1. Introduction Taking advantage of the deep convolutional neural networks (CNNs) as well as huge training datasets, computer vision tech- nique has been improved forward with a great step [1]. Especially in the field of image classification [2] and target detection [3,4], the computer has gained human-level performance. However, generic visual object tracking is still challenging, even with the help of deep CNNs and large training datasets. It is caused by the fact that the object is unknown before tracking and also shows arbi- trary size and appearance during the whole tracking interval. Most existing trackers, with either a generative model or a discrimina- tive model, such as SCM [5], MILTrack [6], Struck [7], TLD [8] and KCF [9] are based on the object-specific approach. The model of the object’s appearance is learned in an online fashion using examples extracted from the video itself. However, the labels of these on- line training examples are assigned by the online tracking results, hence they are not guaranteed to be the ground truth. The reliabil- ity of those labels heavily depends on the validity of the tracking model. Corresponding author. E-mail address: [email protected] (H. Zhang). Although the trackers using deep CNNs are not satisfactory, they still outperform the traditional trackers with shallow archi- tectures. To fully take advantage of the representation power of CNNs in visual tracking, several deep architectures have been de- veloped. It is a tricky work to train a proper CNN in an object spe- cific approach. Approaches have been proposed by transferring the CNN weights pre-trained from a large scale image dataset such as the ImageNet [10–12]. Due to the fact that there is only one ref- erence image (ground truth) for tracking, it makes sense to com- pare the candidate image patches with the reference image patch, and then choose the most similar one. A Siamese CNN satisfies the demand exactly. However, it is difficult for a Siamese network to learn features that are representative enough to adapt to various appearance changes of a same target, such as changes in geom- etry/photometry, camera viewpoint, illumination, or partial occlu- sion. There is still a long journey to build a reliable generic visual tracker for arbitrary object tracking. In this paper, we propose a Siamese CNN tracker based on VGG- M network [13]. Similar CNN architecture has also been employed by trackers in [12,14], which represent the best performance on either accuracy or efficiency up to now. Our proposed network is initialized by the network weights pre-trained on ImageNet classi- fication dataset, and further fine-tuned using three video datasets [15–17]. The entire network is trained end-to-end by SGD with a back-propagation algorithm using the common libraries Caffe [18]. https://doi.org/10.1016/j.neucom.2017.11.050 0925-2312/© 2017 Elsevier B.V. All rights reserved.

Upload: others

Post on 29-Oct-2020

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Visual tracking using Siamese convolutional neural network with region …static.tongtianta.site/paper_pdf/514f1716-49ee-11e9-8da7-00163e08… · Siamese region proposal network is

Neurocomputing 275 (2018) 2645–2655

Contents lists available at ScienceDirect

Neurocomputing

journal homepage: www.elsevier.com/locate/neucom

Visual tracking using Siamese convolutional neural network with

region proposal and domain specific updating

Han Zhang

∗, Weiping Ni, Weidong Yan, Junzheng Wu, Hui Bian, Deliang Xiang

Northwest Institute of Nuclear Technology, Baqiao, Xi’an , 710024, China

a r t i c l e i n f o

Article history:

Received 22 March 2017

Revised 5 November 2017

Accepted 17 November 2017

Available online 29 November 2017

Communicated by Zhu Jianke

Keywords:

Visual tracking

Convolutional neural network

Siamese network

Region proposal

a b s t r a c t

This paper deals with the problem of arbitrary object tracking using Siamese convolutional neural net-

work (CNN), which is trained to match the initial patch of the target in the first frame with candidates in

a new frame. The network returns the most similar candidate with the smallest margin contrastive loss.

For candidate proposals in each frame, a Siamese region proposal network is applied to identify potential

targets from across the whole frame. It is also able to mine hard negative examples to make the network

more discriminative for the specific sequence. The Siamese tracking network and the Siamese region pro-

posal network share weights which are trained end-to-end. Taking advantage of the fast implementation

of fully convolutional architecture, the Siamese region proposal network does not cost much spare time

during online tracking. Although the network is trained to be a generic tracker that can be applied to

any video sequence, we find that domain specific network updating with a short- and long-term strat-

egy can significantly improve the tracking performance. After combining generic Siamese network train-

ing, Siamese region proposal, and domain specific updating, the proposed tracker obtains state-of-the-art

tracking performance.

© 2017 Elsevier B.V. All rights reserved.

1

(

n

i

c

v

d

t

t

e

t

K

o

e

l

h

i

m

t

t

C

v

c

C

t

e

p

a

d

l

a

e

s

t

M

h

0

. Introduction

Taking advantage of the deep convolutional neural networks

CNNs) as well as huge training datasets, computer vision tech-

ique has been improved forward with a great step [1] . Especially

n the field of image classification [2] and target detection [3,4] , the

omputer has gained human-level performance. However, generic

isual object tracking is still challenging, even with the help of

eep CNNs and large training datasets. It is caused by the fact

hat the object is unknown before tracking and also shows arbi-

rary size and appearance during the whole tracking interval. Most

xisting trackers, with either a generative model or a discrimina-

ive model, such as SCM [5] , MILTrack [6] , Struck [7] , TLD [8] and

CF [9] are based on the object-specific approach. The model of the

bject’s appearance is learned in an online fashion using examples

xtracted from the video itself. However, the labels of these on-

ine training examples are assigned by the online tracking results,

ence they are not guaranteed to be the ground truth. The reliabil-

ty of those labels heavily depends on the validity of the tracking

odel.

∗ Corresponding author.

E-mail address: [email protected] (H. Zhang).

b

e

i

fi

[

b

ttps://doi.org/10.1016/j.neucom.2017.11.050

925-2312/© 2017 Elsevier B.V. All rights reserved.

Although the trackers using deep CNNs are not satisfactory,

hey still outperform the traditional trackers with shallow archi-

ectures. To fully take advantage of the representation power of

NNs in visual tracking, several deep architectures have been de-

eloped. It is a tricky work to train a proper CNN in an object spe-

ific approach. Approaches have been proposed by transferring the

NN weights pre-trained from a large scale image dataset such as

he ImageNet [10–12] . Due to the fact that there is only one ref-

rence image (ground truth) for tracking, it makes sense to com-

are the candidate image patches with the reference image patch,

nd then choose the most similar one. A Siamese CNN satisfies the

emand exactly. However, it is difficult for a Siamese network to

earn features that are representative enough to adapt to various

ppearance changes of a same target, such as changes in geom-

try/photometry, camera viewpoint, illumination, or partial occlu-

ion. There is still a long journey to build a reliable generic visual

racker for arbitrary object tracking.

In this paper, we propose a Siamese CNN tracker based on VGG-

network [13] . Similar CNN architecture has also been employed

y trackers in [12 , 14] , which represent the best performance on

ither accuracy or efficiency up to now. Our proposed network is

nitialized by the network weights pre-trained on ImageNet classi-

cation dataset, and further fine-tuned using three video datasets

15–17] . The entire network is trained end-to-end by SGD with a

ack-propagation algorithm using the common libraries Caffe [18] .

Page 2: Visual tracking using Siamese convolutional neural network with region …static.tongtianta.site/paper_pdf/514f1716-49ee-11e9-8da7-00163e08… · Siamese region proposal network is

2646 H. Zhang et al. / Neurocomputing 275 (2018) 2645–2655

s

a

t

m

n

t

t

p

u

b

f

A

o

p

M

D

s

d

[

o

4

t

p

c

e

r

t

2

p

o

p

t

d

d

g

r

p

p

a

a

S

t

t

m

i

E

h

s

w

[

o

a

2

t

i

(

p

The Siamese CNN is not only used for tracking, but also adopted

for candidate object proposals by embedding an additional convo-

lutional layer. The proposed Siamese CNN tracker can be used as a

generic tracker that can be directly applied to any video sequence.

Inspired by the outstanding performance of MDNet [12] , we find

that domain specific updating can improve the tracking perfor-

mance significantly. Experiment results show that our proposed

network gives best performance among the proposed Siamese net-

works [10 , 11, 19] .

The key contribution of our work lies in two aspects. First, a

Siamese CNN tracker is presented and trained using a combined

dataset containing 1719 sequences. Without any model updating,

the tracker is able to obtain comparable performance with MEEM

[20] . We also try to train our Siamese net with a small dataset

containing only 58 sequences. Comparable experiment on OTB-100

dataset shows that training data with a large size and good variety

is important to obtain a good generic CNN tracker. However, the

performance of CNN based tracker is still not as good as expected.

We assume it is because that the deep feature is not representa-

tive enough for all object variations such as deformation, occlusion,

et al. It means that domain specific model updating is essential

to further improve the performance of any tracker. By fine-tuning

the last three layers of our network during online tracking proce-

dure, significant performance improvement is observed. During on-

line updating, a short- and long-term memory strategy is adopted.

The second contribution is that a Siamese region proposal network

is constructed based on the proposed Siamese CNN tracker. The re-

gion proposal network identifies potential object candidates across

the whole incoming frame instead of a small search radius around

the previous target location. The improvement brought by the re-

gion proposal network mainly lies in three aspects. The first one is

to reduce the number of candidate image patches in each frame. It

helps to improve the efficiency of the tracking procedure. The sec-

ond is to re-detect the tracked object after we lost it or the object

is blocked for a while. Third, the region proposal procedure helps

to mine hard negative examples that are used for model updating.

During the tracking process, we update the object model concen-

trating on hard false-positives that are supplied by the region pro-

posal network. Hard false-positive samples help to suppress dis-

tractors caused by complex background clutters, and learn how

to re-rank proposals according to the object model. The proposed

Siamese region proposal network is designed by taking advantage

of the fast implementation of the fully convolutional network pro-

posed in [14] , where an additional correlation layer is appended at

the end. By sharing parameters between the Siamese tracker net-

work and the Siamese region proposal network, only a little spare

time consuming is needed for candidate target region proposal in a

new frame. The Siamese CNN tracker and the Siamese region pro-

posal network are combined in a way similar to the combination of

target objectness region proposal network and the target detection

network in the faster-RCNN target detector [21] .

The paper is organized as follows. Section 2 discusses related

works in tracking and convolutional neural networks. Our tracking

framework is described in Section 3 . Section 4 presents the exper-

imental evaluations and results. Finally, conclusions are given in

Section 5 .

2. Related work

2.1. Siamese tracker

Object representation is one of the major components in any vi-

sual tracking algorithm. Wang [22] concludes that the feature ex-

tractor is the most important part of a tracker and the observa-

tion model is not significantly important if the features are good

enough. Fortunately, deep CNN is a powerful tool to learn good vi-

ual features. Given the initial state (e.g., position and extent) of

target object in the first image, the goal of tracking is to es-

imate the states of the target in the subsequent frames, which

eans to find the new position and extend for the target in the

ew frame. It is straightforward to build a Siamese CNN architec-

ure to find the potential target state that best resembles the ini-

ial state. Siamese CNNs have been successfully applied to face and

edestrian verification. Several recent works have also suggested

sing Siamese CNNs in the context of tracking [10,11,19] .

A Siamese tracking architecture is designed in [10] , which is

ased on the deep VGGNet incorporating with ROI pooling features

rom multiple convolution layers. The network is trained using the

LOV dataset [17] from which 60,0 0 0 pairs of frames and each pair

f frames containing 128 pairs of boxes are extracted. However the

roposed tracker only performs slightly better than MEEN [20] and

USTer [23] , both of which are shallow feature based approaches.

ue to the large amount of parameters, the tracker takes about a

econd to process one frame on GPU, even without any model up-

ating during online tracking.

A novel fully-convolutional Siamese network is proposed in

14] which is trained end-to-end based on the ILSVRC15 video

bject detection dataset. The training dataset contains more than

0 0 0 sequences. Taking advantage of the fast implementation of

he fully-convolutional architecture, this approach achieves com-

etitive performance in modern tracking benchmarks. The fully-

onvolutional Siamese network is able to perform at speeds that

xceed the real-time requirement. The fast implementation also

elies on the fact that no model update is incorporated during

racking.

.2. Region proposal

Most of the current visual trackers search the potential target

osition in a search radius centered on the previous predicted

bject location with the radius no more than two times of the

revious target extent [7] . One important reason that the existing

rackers avoid employing a larger search radius is the potential

istractions from the background. It is not a trivial task to up-

ate a discriminative classifier when the negative sample space

rows greatly with the samples coming from the extended search

adius. Besides, the radius search approach requires dense region

roposal with great redundancy which slows down the tracking

rocedure. Also it may not be valid always, in particular for fast

nd irregularly moving objects.

Proper region proposals are able to improve the performance of

ny tracker significantly. In [10] , the performance of the proposed

INT tracker is improved by a considerable margin with an adap-

ive sampling and a simple use of optical flow [24] . In [25] spatial-

emporal saliency guided sampling approach is adopted to guide a

ore accurate target localization for qualified sampling within an

nter-frame motion flow map. In [26] , an object tracker termed as

BT is presented that is not limited to a local search window and

as an ability to probe efficiently the entire frame. The EBT tracker

earches for instance-specific candidate target proposal across the

hole frame by training a SVM classifier based on Edge Box feature

27] . Experiment results show that the edge box based proposals

utperforms traditional target candidate proposal approaches such

s nearby window search or particle search.

.3. Online updating

Online updating is essential no matter for shallow feature based

racker, or deep feature based one. However, it is time consum-

ng and difficult to control. The Discriminative Correlation Filter

DCF) based trackers apply a more tractable online optimization

roblem by changing to the Fourier basis [28] . Benefiting from

Page 3: Visual tracking using Siamese convolutional neural network with region …static.tongtianta.site/paper_pdf/514f1716-49ee-11e9-8da7-00163e08… · Siamese region proposal network is

H. Zhang et al. / Neurocomputing 275 (2018) 2645–2655 2647

t

s

p

p

i

p

b

p

T

s

m

M

D

i

b

u

b

t

N

d

d

t

p

d

3

[

t

i

w

t

s

i

i

p

a

a

S

s

e

C

l

3

3

o

w

[

[

n

a

t

e

b

e

s

o

r

t

[

i

Fig. 1. The proposed Siamese CNN network to learn the generic visual tracking fea-

tures. Each ‘conv’ layer is followed by a ‘relu’ layer, while only the first two ‘conv’

layers are followed by a 3 × 3 ‘pooling’ layer with stride of 2. Numbers in square

brackets are kernel size, number of outputs and stride. No padding is applied. The

last layer ‘Fc1’ is also implemented by a ‘conv’ layer.

t

o

L

w

{

d

s

3

g

s

r

a

t

g

m

o

o

a

i

i

c

m

a

q

i

he fast implementation of FFT, the DCF based online updating

hows excellent time efficiency. The authors of MEEM [20] pro-

ose a multi-expert restoration scheme to address the model drift

roblem in online tracking. In the MEEM, a tracker and its histor-

cal snapshots constitute an expert ensemble, where the best ex-

ert is selected to restore the current tracker. The deep feature

ased TCNN tracker [29] imitates the MEEM by managing multi-

le target appearance models in a tree structure. The MUlti-Store

racker (MUSTer) [23] is a dual-component approach consisting of

hort- and long-term memory stores to process target appearance

emories. The short- and long-term memory strategy enables the

USTer tracker outperforming the MEEM tracker by about 10%.

ue to the excellent performance as well as easy implementation,

t is one of the most popular methods adopted by deep network

ased trackers (such as MDNet and SANet [30] ) for online model

pdate. The multi-domain Network (MDNet) in [12] presents the

est tracking performance compared with other state-of-the-art

echniques. We suppose that the excellent performance of the MD-

et owes to three reasons. First, common vision information and

omain specific information are separated. Second, the online up-

ating scheme of MDNet consists of both short-term and long-

erm updater. Third, hard negative examples mined around the

revious predicted target locations help to improve the online up-

ating performance.

. The proposed Siamese CNN tracker

Inspired by the previous mentioned trackers, especially from

12,14] , and [26] , we present a new Siamese CNN tracker that is

rained with a margin contrastive loss. First of all, the Siamese CNN

llustrated in Fig. 1 is trained with different training sequences,

hich is presented in Section 3.1 . A generic tracker is obtained af-

er network offline training. The generic tracker is used to demon-

trate the effective of our proposed Siamese structure as illustrated

n Fig. 1 . The way of how the generic tracker works is presented

n the first part of Section 3.2 . In order to improve the tracking

erformance, the domain specific fine-tuning as well as the short-

nd long-term memory based online updating are adopted, which

re presented in the second part of Section 3.2 . In the last part of

ection 3.2 , we proposed a new online tracking approach by con-

tructing a Siamese region proposal network, which is applied by

mbedding a correlation layer at the end of the Siamese tracker.

andidate image patches as well as hard negative samples for on-

ine tracking are mined by the region proposal network.

.1. Training of Siamese CNN tracker

.1.1. Architecture of the training network

The proposed network is illustrated in Fig. 1 . The architecture

f each single branch of the Siamese network is exactly the same

ith VGG-M network [13] , which is also used for visual tracking in

12] and [29] . Note that multi-scale ROI pooling strategy as used in

10] is not adopted. As suggested by Nam and Han [12] , smaller

etwork is sufficient for visual tracking because visual tracking

ims to distinguish only two classes, target and background. Fur-

her, the targets in visual tracking task are typically small. How-

ver, Bertinetto [14] demonstrates that the deep ResNet performs

etter than AlexNet for their proposed SiameseFC tracker. Appar-

ntly, deeper network is more discriminative but also time con-

uming. To balance the trade-off between accuracy and efficiency,

ur network consists of 6 convolution layers with a stride of final

epresentation of 16. Note that no padding is applied to convolu-

ional or pooling layer so that the network is translation invariant

14] .

The network receives two 161 × 161 RGB image patches { img i ,

mg j } as inputs. The output of the last layer ‘fc1’ has 512 units,

ermed as f i , f j . Margin contrastive loss is taken as the final layer

f the Siamese network,

(f i , f j

)=

1

2

y i j D

2 (

f i , f j )

+

1

2

(1 − y i j

)max

(0 , ε − D

2 (

f i , f j ))

here y ij ∈ {0, 1} stands for the ground truth label of the input pair

img i , img j } indicating if they are the same object or not. D ( f i , f j )

enotes the Euler distance between two features. The margin ɛ is

et to be 1.

.1.2. Dataset for training

Our first attempt is to train the Siamese network to obtain

eneric CNN features for visual tracking. Good features for tracking

hould be able to represent all kinds of variations such as target

otation, deformation, occlusion, etc. In our network architecture

s shown in Fig. 1 , the Euler distance of these feature representa-

ions should be independent to any particular target. In order to

ain this end, the network should have seen training sequences as

any as possible. Furthermore, the training dataset should be vari-

us enough, covering a good amount of semantics and not focusing

n any particular type of objects. We also need to make sure that

ll the training dataset do not appear in the test dataset (OTB-100

s used as the test dataset in our experiments in Section 4 ).

Two different datasets are adopted for training. The first one

s the VOT dataset which is small, while the second one is the

ombination of VOT, ALOV and ImageNet Video dataset that is

uch larger. For the first training dataset, the VOT-2013, VOT-2014

nd VOT-2015 are combined together. After removing video se-

uences appeared in the OTB-100 dataset (for test), the final train-

ng dataset termed as VOT-58 containing 58 video sequences. The

Page 4: Visual tracking using Siamese convolutional neural network with region …static.tongtianta.site/paper_pdf/514f1716-49ee-11e9-8da7-00163e08… · Siamese region proposal network is

2648 H. Zhang et al. / Neurocomputing 275 (2018) 2645–2655

g

b

t

s

3

3

g

t

b

i

t

o

t

w

a

d

t

o

d

F

d

t

p

t

l

w

I

g

p

w

m

t

k

o

t

M

a

fi

S

r

u

3

l

d

t

w

s

p

i

w

n

b

fi

n

v

a

same training dataset is applied to training the MDNet network

[12] , which has an excellent tracking performance by testing on

the OTB-100. The ImageNet Video dataset [16] contains more than

40 0 0 video sequences of 30 categories, such as turtles, lizards, bi-

cycles and so on. This dataset is applied to train a fully convolu-

tional Siamese network in [14] which achieves good tracking per-

formance with real-time speed. In order to make sure that each

sequence for training is suitable for visual tracking, we apply the

dataset curation steps as suggested in [14] . Furthermore, we man-

ually remove sequences that contain too much occlusion, or too

many similar targets on a same frame to avoid confusion. Se-

quences that contain less than 80 frames are also removed. Finally,

1359 sequences are chosen out of the ImageNet Video dataset,

termed as VID-1359. There is considerable discrepancy between

VOT-58 and VID-1359 in terms of target categories. In order to

let the network see as many sequences as possible, the ALOV

dataset is also introduced. The ALOV-302 is collected from the

ALOV dataset after removing overlapped sequences with OTB-100.

As a result, the second training dataset {VID-1359, VOT-58, ALOV-

302} contains 1719 different sequences, termed as Combined-1719.

3.1.3. Training details

Instead of training the two-stream Siamese network from

scratch, we load the pre-trained network parameters for ImageNet

classification task, and fine-tune the Siamese network with our

training sequences. Since the filters of the lower layers get acti-

vated the most on lower level visual patterns such as edges and

angles, whereas higher layers get activated the most on more com-

plex patterns such as faces and wheels, we assume that the fea-

tures of lower layers for different visual task are similar. Thus, the

fine-tuning learning rate of the lowest 3 convolution layers is set

10 times smaller than the other layers.

The training dataset contains the ground truth box annotation

within each frame following a particular object. For each frame of

a training sequence, training image patches are extracted randomly

based on the box annotation. All the image patches that have an

intersection-over-union (IOU) overlap larger than 0.7 with the cor-

responding ground truth box are considers to be positive samples,

while the ones with IOU overlap smaller than 0.5 are taken as

negative samples. 50 positive and 200 negative region boxes are

selected by randomly shifting and scaling the ground truth box.

Thus, for a training sequence with N frames, N × 50 positive sam-

ples and N × 200 negative samples are prepared for network train-

ing. All these samples are resized to be 161 × 161 × 3 as network

input. The batch size is set as 8 which consists of 3 positive im-

age patch pairs and 5 negative image patch pairs. For each positive

pair, 2 different positive samples are randomly chosen from a same

training sequence. For each negative pair, 1 positive sample and

1 negative sample from the same sequence are chosen. We have

also tried to select negative pair from different sequences, but it

turns out to make little difference. That is, for each training batch,

one sequence is randomly chosen from all the training sequences.

Thus, 6 positive samples and 5 negative samples are selected from

the chosen sequence to construct 3 positive and 5 negative train-

ing pairs. Note that the batch size for network training is quiet

small. Besides, samples in a batch are drawn from a same sequence

frame, so they are not independent with each other. Thus, no batch

normalization is applied.

For the VOT58 dataset, training is performed over 5 epochswith

each consisting of 10 0 0,0 0 0 sampled pairs. For the combine

dataset, 5 epochs, each consisting of 60,0 0 0,0 0 0 samples pairs are

performed. The initial learning rate is set as 10 −4 for the lowest

three convolution layers, and 10 −3 for the others. The learning rate

is decreased by 0.2 for each epoch.

We also tried to train the proposed Siamese network by adopt-

ing the training procedure of the Siamese-FC network, where tar-

et patch combined with large search image are taken as two

ranch inputs and logistic loss averaged on the final score map are

aken as the loss value for BP update. The trained Siamese tracker

hows similar performance with our proposed training approach.

.2. Tracking by Siamese CNN tracker

.2.1. The generic tracker

After the previous training procedures are implemented, a

eneric Siamese CNN Tracker is obtained. Generic tracker means

hat no model updating is applied during tracking. The target

ounding box in the first frame is the only clue for the follow-

ng tracking procedure. Specifically for our proposed Siamese CNN

racker, the tracking network is constructed by a single stream

f the Siamese training network after removing the margin con-

rastive loss layer. The output of the single stream tracking net-

ork is the ‘fc1’ feature with 512 units. After applying RELU oper-

tion, the final ‘fc1’ feature for tracking is quite sparse.

The radius sampling strategy [7] is employed for target candi-

ate generation of the generic tracker. Around the predicted loca-

ion of the previous frame we sample locations randomly on circles

f different radii. Different from [7] , we generate multiple candi-

ate boxes at different scales in order to handle scale variations.

or each frame, 256 samples are drawn in the translation and scale

imension. Euler distance between the ‘fc1’ feature of the ground

ruth image patch p gt and the ‘fc1’ feature of the candidate sample

i is measured. Candidate patch that has the smallest Euler dis-

ance from the ground truth patch p gt is assumed to be the new

ocation of the tracked object.

p ∗ = arg min

p i

(∥∥ f c 1 (

p i )

− f c 1 (

p gt )∥∥)

here the ground truth patch p gt is target patch in the first frame.

t is our only prior information of the sequence to be tracked.

The test result on OTB-100 dataset shows that our proposed

eneric network has similar performance with Siamese network

roposed in [10 , 14] , but there is still a significant gap compared

ith the MDNet proposed in [12] . It is caused by the fact that no

odel updating is performed during tracking. It is an ideal goal

o gain a bag of generic CNN features that is representative for all

inds of variations for visual tracking. However, the performance

f the generic CNN tracker is not as good as expected, even af-

er having seen almost 20 0 0 training videos. Remind that both the

DNet and all the other excellent shallow feature based trackers

pply domain specific training using the object information on the

rst frame. They also employ online model update during tracking.

ince our trained generic CNN features for tracking are not rep-

esentative enough, we should also consider incorporating model

pdating and other tricks into our proposed CNN tracker.

.2.2. Tracking with domain specific updating

Most shallow feature based discriminative trackers utilize off-

ine trained SVM or boosting classifier to perform tracking-by-

etection procedure and then update the classifier during on-line

racking. The classifier is initially trained by the first frame image,

hich is the only cue of the sequence to be tracked. The domain-

pecific layers of the CNN based MDNet are also pre-trained by the

ositive and negative samples drawn from the first frame of test-

ng sequence. Instead of adding a domain specific layer to our net-

ork, we fine-tune the last 3 convolutional layers of our Siamese

etwork using the first frame and the corresponding ground truth

ounding box. Positive and negative samples are drawn from the

rst frame. More concretely, 500 positive samples as well as 20 0 0

egative samples are selected randomly to fine-tune the last 3 con-

olution layers of the Siamese network, that are the ‘conv4’, ‘conv5’

nd the ‘fc1’ layer. The domain specific fine-tuning procedure is

Page 5: Visual tracking using Siamese convolutional neural network with region …static.tongtianta.site/paper_pdf/514f1716-49ee-11e9-8da7-00163e08… · Siamese region proposal network is

H. Zhang et al. / Neurocomputing 275 (2018) 26 45–2655 264 9

Fig. 2. The proposed Siamese Region Proposal network to detect instance-specific candidate regions.

t

i

t

o

s

a

t

t

a

S

f

l

a

i

s

S

o

t

N

s

p

i

h

e

N

b

i

a

m

t

u

t

i

s

o

a

o

w

h

a

t

fi

a

3

l

a

a

s

m

c

w

c

o

l

t

a

w

o

i

p

i

s

s

s

s

i

t

t

o

3

n

f

o

2

w

s

s

i

i

t

t

s

t

r

t

r

m

a

l

b

he same as the network training procedure except that the learn-

ng rates of the ‘conv4’, ‘conv5’ layers are fixed to be 0.0 0 01, while

he ‘fc1’ is set as 10 times of ‘conv4’, ‘conv5’ layers (0.001) and the

thers 0. Besides, the batch size for domain specific fine-tuning is

et as 64, and 30 batches are performed.

In order to further improve the tracker’s performance, we adopt

n short- and long-term online update procedure which is similar

o MDNet. Specifically, for each frame on which the detected op-

imal target has a high confidence level, positive and negative ex-

mples are randomly drawn around the detected target position.

hort-term memory means that online training samples collected

rom shortly recent frames are used for model updating, while

ong-term memory updating collects online training samples from

longer time span. Long-term updates are performed in regular

ntervals (for our tracker, every 20 frames) using positive training

amples collected from the recent 100 frames (a long-term period).

hort-term updates are conducted whenever the confidence level

f estimated target is low, using the positive samples drawn from

he recent 30 frames (a short-term period). As suggested by MD-

et, in both short-term and long-term update cases, the negative

amples are collected in the short-term, since old negative exam-

les are often redundant or irrelevant to the current frame. Specif-

cally, for each frame on which the detected optimal target has a

igh confidence level, 50 positive examples as well as 100 negative

xamples are randomly drawn around the detected target position.

ote that the confidence level of the predicted box is measured

y the Euler distance between the ‘fc1’ features of the candidate

mage patch and the ground truth patch. The online training ex-

mples are collected only when the Euler distance is less than 0.3,

eaning that the confidence level is high. The last three layers of

he tracking network are updated for every 20 frames (long-term

pdate) or whenever the Euler distance is bigger than 0.5 (short-

erm update), meaning that the confidence level is low. The learn-

ng rate and batch size of the online update procedure are the

ame with those of the domain specific fine-tuning procedure, but

nly 10 batches are performed. Those domain specific fine-tuning

nd online updating are both termed as domain specific updating.

Apparently, the tracker could perform better if it has seen the

bject and its background more thoroughly. With this end in view,

e design a Siamese region proposal network that searches for

ard negative examples for domain specific network fine-tuning

nd online updating. Those hard negative examples are combined

ogether with the randomly selected negative examples for model

ne-tuning. Furthermore, the proposed region proposal network is

lso used for candidate target proposals.

.2.3. Tracking with Siamese region proposal network

As presented in the Section 2.2 , An object tracker that is not

imited to a local search window is presented in [26] . It has the

bility to probe the entire frame efficiently. The method generates

small number of “high-quality” proposals by a novel instance-

pecific objectness measure and evaluates them against the object

odel. During the tracking process, the object model is updated

oncentrating on hard false-positives supplied by the proposals,

hich help suppressing distractors caused by complex background

lutters. The method is a shallow feature based approach, but it

utperforms most recent state-of-the-art trackers.

Inspired by EBT tracker, we propose to use a Siamese convo-

utional network for instance-specific region proposal. In order to

ake advantage of the fast implementation of fully convolutional

rchitecture, a dynamic convolution layer is appended after the

conv5’ layers of our proposed Siamese tracking network. The net-

ork architecture for region proposal is shown in Fig. 2 . There are

nly two differences between the region proposal network shown

n Fig. 2 and the tracking network illustrated in Fig. 1 . For region

roposal network, the input image of the first stream, termed as

mg_gt, has fixed size of 161 × 161 × 3, while the input size of the

econd stream, termed as img_sc, is arbitrary, depending on the

ize of the whole frame to be tracked. Actually the input of the first

tream is fixed as the ground truth target image patch which is re-

ized to 161 × 161 × 3, thus we have to make sure that the search

mage, which is the input of the second stream, is also resized by

he same scale. The ‘conv5’ feature of img_gt is 3 × 3 × 512, while

he ‘conv5’ feature of img_sc is k × k × 512 with k > 3, depending

n the size of the img_sc. For the dynamic convolution layer, the

× 3 × 512 sized ‘conv5’ feature is taken as the convolutional ker-

el weights, while the k × k × 512 sized ‘conv5’ feature is taken as

eature map to be convoluted on.

Specifically as shown in Fig. 2 , with the search image sized

f 449 × 449 × 3, the size of the ‘conv5’ feature would be

1 × 21 × 512. After convolving with the 3 × 3 × 512 sized kernel,

hich is the ‘conv5’ feature of the img_gt, the final score map

ized of 19 × 19 is obtained. Each pixel value on the resulting

core map represents the similarity between the img_gt patch and

mg_sc patch on the corresponding position. In [14] , the score map

s directly adopted as the guide to locate the new position of

he searched instance. As shown in Fig. 2 , the score map reveals

he highest value at the center. Correspondingly, we can find the

earched instance appeared at the center of the search image. Al-

hough the tracker proposed in [14] is able to operate at frame-

ates beyond real-time, the tracking performance is not satisfac-

ory, even when the network is trained on the ResNet which is

eally deep. Instead of making final decision merely by the score

ap, we draw object-specific region proposals from the score map,

nd further decide if these proposals are the final tracking result.

After adding a cosine window to the score map to penalize

arge displacements, a small number of “high-quality” proposals

y an instance specific objectness measure are generated in three

Page 6: Visual tracking using Siamese convolutional neural network with region …static.tongtianta.site/paper_pdf/514f1716-49ee-11e9-8da7-00163e08… · Siamese region proposal network is

2650 H. Zhang et al. / Neurocomputing 275 (2018) 2645–2655

Fig. 3. Success plots on OTB100 for the internal comparisons. ‘dsft’ stands for do-

main specific finetuning. ‘dsft’ is also applied to the ‘radius_sampling_update’ and

‘region_proposal_update’ curves.

Fig. 4. Success plots on OTB100 for the online update parameter setting compar-

isons.

m

c

a

t

1

V

V

u

a

v

(

(

s

t

S

t

f

t

different scales {0.964, 1, 1.037}, that is 42 ∗3 = 126 samples. These

candidates are used for further evaluated by the proposed CNN

tracker to determine the new object positions during online

tracking.

Furthermore, the “high-quality” proposals drawn from the

whole frame are also explored as hard negative examples for do-

main specific model fine-tuning and online updating. Hard neg-

ative examples explored across the whole image help the net-

work check the background of the object more thoroughly, thus to

improve the tracking network’s ability to distinguish the tracked

instance from clutters in the background. More concretely, high

scored regions with the IOU overlap with the ground truth bound-

ing box less than 0.3 are taken as the hard negative examples

for domain specific fine-tuning and online updating. In all our ex-

periment, the top 25 positions on the score map are considered

as “high-quality” proposals to be further tested if they are hard

negative samples. For the domain specific fine-tuning procedure,

the top 25 positions with all the shift positions by half the stride

as well as three scales {0.964,1,1.037} are considered, which are

25 ∗9 ∗3 = 675 examples. Hard negative examples are drawn from

the 675 patches. Note that only the ones from the 675 patches that

has an IOU overlap with the detected object less than 0.3 are col-

lected as hard negative examples. Besides, the score map is not pe-

nalized by cosine window for hard negative sample drawing proce-

dure. Considering that the background of video sequences usually

changes slowly, the hard negative examples are drawn from the

Region Proposal Network for every 30 frames.

Since the CNN features between the region proposal network

and the tracking network are shared, the way of integrating the

region proposal network and the tracking network is similar to

that of the faster-RCNN target detection algorithm which shares

the CNN features between objectness region proposal and target

detection. To sum up, the Siamese region proposal network helps

to improve the tracking performance from two aspects. The first

is that the network is able to search for candidate target posi-

tions from the whole frame, similar to the EBT tracker. The sec-

ond is that hard negative examples mined by the Siamese region

proposal network make the tracking network more distinguishable

compared with negative examples mined by random sampling ap-

proach.

4. Experiments

The proposed tracker termed as SRPT (Siamese Region Pro-

posal Tracker) is evaluated on a large benchmark dataset OTB-100

[31] containing 100 videos with comparisons to state-of-the-art

methods. Trackers are evaluated following the protocol in [31] us-

ing success plot, which measures the percentage of successfully

tracked frames. A frame is defined as successfully tracked if the

IOU value between the predicated bounding box and the ground

truth box is bigger than a threshold. The success plot is obtained

by assigning different IOU threshold in [0, 1]. Per distortion type

comparison is also performed. OTB-100 dataset contains 11 differ-

ent types: background clutter, occlusion, fast motion, scale varia-

tion, deformation, illumination variation, motion blur, out of view,

in-plane rotation, out of plane rotation, and low resolution. Track-

ers are ranked by the area under curve (AUC) score for the plots

generated by the available toolkit provided by the benchmark.

4.1. Design evaluation

4.1.1. Basic network

The proposed Siamese tracker is trained by two different train-

ing datasets, VOT-58 and Combined-1719 respectively. The trained

network is adopted as a generic tracker and used to perform ar-

bitrary object tracking on the test dataset OTB-100. None of do-

ain specific fine-tuning or on-line updating is applied. The suc-

ess plots as well as the AUC values are shown in Fig. 3 , termed

s SRPT_VOT-58 and SPRT_Combined-1719 respectively. We can see

hat the network performs much better after trained on Combined-

719 dataset which contains almost 30 times more sequences than

OT-58 dataset. Actually, the network tends to be overfitted to the

OT dataset after training for more than 5 epochs, thus losses its

niversality for arbitrary object tracking.

Although the proposed network shows acceptable performance

fter trained on the Combined-1719 dataset with the AUC

–OTB100

alue 0.524, it is only comparable with the MEEM approach

0.530), but not as good as the SiameseFC (0.547) or the SINT

0.583) methods, as shown in Fig. 4 . Actually, the MEEM tracker is

hallow feature based approach with complex domain specific ini-

ialization and online-updating strategy, while the SiameseFC and

INT are deep feature based approaches and designed as generic

racker without any online updating during tracking. The good per-

ormance of deep feature based approach relies on the fact that

he features are more representative than shallow features and are

Page 7: Visual tracking using Siamese convolutional neural network with region …static.tongtianta.site/paper_pdf/514f1716-49ee-11e9-8da7-00163e08… · Siamese region proposal network is

H. Zhang et al. / Neurocomputing 275 (2018) 2645–2655 2651

t

l

t

S

p

d

a

n

t

a

t

b

R

t

l

4

t

o

w

f

p

t

A

l

o

0

c

e

w

a

a

d

i

t

g

p

t

u

n

u

fi

b

S

t

(

a

r

(

o

p

o

w

S

a

s

d

t

T

o

c

a

Fig. 5. Success plots on OTB100 for the different trackers comparisons.

u

r

4

d

t

S

D

d

t

4

o

i

M

f

[

o

S

t

a

2

i

p

l

g

S

M

u

a

4

s

t

p

t

s

w

rained specifically for tracking task. The SINT network has a simi-

ar architecture compared with our proposed network, except that

he SINT is constructed on the very deep VGGNet. Moreover, the

INT network adopted multi-scale ROI pooling method aiming for

recise localization and overcoming severe computation overhead

uring tracking. The proposed Siamese Tracker is much smaller

nd simpler compared with SINT network. However, we cannot ig-

ore the fact that the SiameseFC tracker only contains 5 convolu-

ional layers which is the same with our proposed network. We

ssume that the excellent performance of SiameseFC depends on

he cross-correlation based final map, which has been proven to

e quite suitable for the tracking task for many times [14,23,32,33] .

emember that MDNet is able to achieve amazing performance by

raining on the small dataset VOT-58. It owes to the excellent on-

ine updating and hard negative sample mining strategy.

.1.2. Incorporating domain specific updating

For our updating approach, there are two steerable parameters

o control the online learning procedure. One is the learning rate

f online model updating, the other is confidence level threshold

e used to define whether the tracked object is reliable enough

or training example drawing. In order to verify the validity of our

arameter settings, we try different parameter values on OTB-100

esting dataset to see the difference between the success plots’

UC scores. Specifically, we set the learning rate of conv4-conv5

ayers as [0.1, 1, 10] ∗ 0.0 0 01 separately with the confidence thresh-

ld set as 0.3, and then we set the confidence threshold as [0.2,

.3, 0.4] with the learning rate of conv4-conv5 as 0.0 0 01. The suc-

ess plots are shown in the Figure 4 . We can see that our param-

ter setting (lr = 0.0 0 01, thres = 0.3) gives best performance. It is

orth noting that our tracker is not quite sensitive to learning rate

nd confidence level threshold. This phenomenon somehow guar-

ntees the robustness of the online updating procedure. From per

istortion type comparison, we also find that smaller learning rate

s not good for illumination variation, deformation and in-plane ro-

ation changes and bigger confidence threshold is better for back-

round cluster circumstance, which is quite reasonable.

Next, we compare the basic tracking network with other three

articular design choices: (1) domain specific fine-tuning based on

he first frame of each tracking sequence; (2) tracking and online

pdating by radius sampling strategy; (3) adopting region proposal

etwork for drawing tracking candidates and samples for online

pdating.

Although domain specific fine-tuning is only performed on the

rst frame of the tracking sequences, notable improvement can

e seen from Fig. 3 by testing on OTB-100. The AUC value of

RPT-dsft is 0.558, which is slightly higher than the SiameseFC

racker (0.547), but still not as good as the very deep SINT tracker

0.583). Owing to a dual-component approach consisting of short-

nd long-term memory stores to process target appearance memo-

ies, the shallow feature based MUSTer tracker also performs better

0.572) than the proposed SRPT-dsft.

The AUC value grows by a big margin for tracking result when

nline model updating is employed, no matter by radius sam-

ling or network region proposal. Besides, there is an AUC margin

f 0.033 between the two different online-updating approaches,

hich verifies the effect of our specially designed combination of

iamese tracker and Siamese region proposal network, which share

rchitecture and weights, thus cause little spare time spent. Be-

ides, for each frame of the tracking scene, there are only 126 can-

idates to be checked if they are the tracked object, which is less

han approaches which draw candidate by greedy local searching.

hus the tracking efficiency is improved. Without online updating,

ur tracker runs at 5 fps on a Geforce GTX 970 M GPU. With the

ommonly used radius sampling online updating strategy, it runs

t 1 fps, which is comparable with the MDNet tracker. With online

pdating and our proposed region proposal network, our tracker

uns at 3 fps.

.2. State-of-the-art comparison

We evaluate the proposed tracker with comparisons to 6 tra-

itional state-of-the-art trackers. MDNet represents for deep CNN

racker with best performance. SINT and Siamese-FC are deep

iamese CNN trackers without any domain specific finetuning.

eepSRDCF [34] is the best correlation filter tracker combined with

eep CNN features. MEEM and MUSTer are shallow feature based

rackers with excellent performance.

.2.1. Overall comparison

Fig. 5 shows success plots of 6 trackers compared with plot

f our proposed SRPT tracker. All these 6 trackers are proposed

n recent two years, and represent best performance ones. The

DNet has been the best tracker in terms of tracking accuracy

or the whole 2016 until recently presented ECO [28] and SANet

30] tracker, which outperforms MDNet for about 1.5% in terms

f AUC

–OTB100. We can see that the AUC

–OTB100 value of our

RPT tracker is 0.9% smaller compared with MDNet. However, due

o the region proposal strategy, candidate targets of each frame

re much smaller compared with that of MDNet, which is set to

56. Also, due to the smaller frequency for model updating, SRPT

s three time faster than MDNet, yet giving comparable tracker

erformance. Due to the absence of model updating during on-

ine tracking, the deep feature based SINT and SiameseFC network

ives smaller AUC values (0.583 and 0.547, respectively). However,

iameseFC is able to perform at a real-time speed. MUSTer and

EEM are shallow feature based approaches with complex online

pdating strategy. They show comparable performance with SINT

nd SiameseFC.

.2.2. Per distortion type comparison

The characteristics of tracking algorithms can be better under-

tood from the sequences with the same attributes. There are 11

ypes of distortions in OTB-100 dataset. Per distortion type com-

arison (except for in-plane rotation) between SRPT and other 6

rackers are shown in Fig. 6 . We can see that our SRPT tracker

lightly outperforms MDNet for 6 out of 11 types of distortions,

hich further verifies the effectiveness of SRPT. It can be noticed

Page 8: Visual tracking using Siamese convolutional neural network with region …static.tongtianta.site/paper_pdf/514f1716-49ee-11e9-8da7-00163e08… · Siamese region proposal network is

2652 H. Zhang et al. / Neurocomputing 275 (2018) 2645–2655

Fig. 6. The success plots for ten challenge attributes: illumination variation, out-of-plain rotation, scale variation, occlusion, motion blur, deformation, fast motion, background

clutter, out of view, and low resolution.

Page 9: Visual tracking using Siamese convolutional neural network with region …static.tongtianta.site/paper_pdf/514f1716-49ee-11e9-8da7-00163e08… · Siamese region proposal network is

H. Zhang et al. / Neurocomputing 275 (2018) 2645–2655 2653

Fig. 7. Qualitative results of the proposed method on some challenging sequences (from top to bottom: Bird1, Bolt2, ClifBar, Coupon).

Fig. 8. Failure cases of our method (Jump and Soccer). Green and red bounding boxes denote the ground-truths and our tracking results, respectively. (For interpretation of

the references to color in this figure legend, the reader is referred to the web version of this article.)

t

o

v

o

t

t

S

t

m

s

A

4

t

hat SRPT is better than MDNet with a smaller overlap thresh-

ld, indicating that the predicted bounding box of SRPT is not

ery tight. When overlap threshold is toward 1, the success rate

f DeepSRDCF is higher compared with all other trackers for both

he overall plot and per distortion plots. It means that the DCF

racker framework is able to present tighter tracking bounding box.

uccess plots of low resolution in the last figure of Fig. 6 show

hat SiameseFC performs much better than SRPT and MDNet. It is

i

ainly caused by the fact that SiameseFC network has an overall

tride of 8 whereas the strides of SRPT and MDNet are both 16.

pparently, big stride step is not helpful for small targets tracking.

.2.3. Qualitative evaluation

Qualitative comparison of our approach with state-of-the-art

rackers on the challenging Bird1, Bolt2, ClifBar, Coupon sequences

s shown in Fig. 7 . It can be observed that our proposed SRPT

Page 10: Visual tracking using Siamese convolutional neural network with region …static.tongtianta.site/paper_pdf/514f1716-49ee-11e9-8da7-00163e08… · Siamese region proposal network is

2654 H. Zhang et al. / Neurocomputing 275 (2018) 2645–2655

[

[

[

[

[

[

[

tracker is able to deal with deformation, background clutter, ro-

tation, etc. Fig. 8 shows two failure cases of the proposed SRPT

tracker. It can be seen that the tracker gets lost when the tracked

object moves fast and dramatically (Jump) or when too much oc-

clusion occurs (Soccer).

5. Conclusion

We propose a visual tracker using Siamese CNN combined with

Siamese region proposal network and domain specific updating.

The region proposal network is able to identify potential targets

from across the whole frame, and also to mine hard negative ex-

amples to make the network more discriminative for the specific

sequence. It shares weights with the tracking network, thus does

not spend too much spare time for region proposal. Domain spe-

cific fine-tuning and short- and long-term based online updating

are also adopted. Thorough analysis about how each design com-

ponent helps to improve the tracking performance is presented.

For trackers based on CNN network, the size and variety of training

dataset are important. Domain specific fine-tuning using samples

drawn from the first frame is able to improve the performance.

The model updater can affect the result significantly. Experiment

results on OTB-100 dataset show that the proposed tracking algo-

rithm performs favorably against existing state-of-the-art methods.

Reference

[1] Y. Guo , Y. Liu , A. Oerlemans , et al. , Deep learning for visual understanding: a

review, Neurocomputing 187 (2016) 27–48 .

[2] K. He , X. Zhang , S. Ren , J. Sun , Deep residual learning for image recognition,in: Proceedings of the CVPR, 2016 .

[3] J. Redmon , S. Divvala , R. Girshick , A. Farhadi , You only look once: unified, re-al-time object detection, in: Proceedings of the CVPR, 2016 .

[4] W. Liu , D. Anguelov , D. Erhan , C. Szegedy , S. Reed , C. Fu , A.C. Berg , SSD: singleshot MultiBox detector, in: Proceedings of the ECCV, 2016 .

[5] W. Zhong , H. Lu , M.-H. Yang , Robust object tracking via sparse collaborative

appearance model, IEEE Trans. Image Process 23 (5) (2014) 2356–2368 . [6] B. Babenko , M.H. Yang , S. Belongie , Visual tracking with online multiple in-

stance learning, in: Proceedings of the CVPR, IEEE, 2009, pp. 983–990 . [7] S. Hare , A. Saari , P.H.S. Torr , Struck: Structured output tracking with kernels,

in: Proceedings of the ICCV, IEEE, 2011, pp. 263–270 . [8] Z. Kalal , K. Mikolajczyk , J. Matas , Tracking-learning-detection, PAMI 34 (7)

(2012) 1409–1422 .

[9] J.F. Henriques , R. Caseiro , P. Martins , J. Batista , High-speed tracking with Ker-nelized correlation filters, PAMI 37 (3) (2015) 583–596 .

[10] R. Tao , E. Gavves , A.W.M. Smeulders , Siamese instance search for tracking, in:Proceedings of the CVPR, 2016 .

[11] K. Chen , W. Tao , Once for all: a two-flow convolutional neural network forvisual tracking, IEEE Transactions on Circuits & Systems for Video Technology

PP (99) (2016) 1 .

[12] H. Nam , B. Han , Learning multi-domain convolutional neural networks for vi-sual tracking, in: Proceedings of the CVPR, 2016 .

[13] K. Chatfield , K. Simonyan , A. Vedaldi , A. Zisserman , Return of the devil in thedetails: delving deep into convolutional nets, in: Proceedings of the BMVC,

2014 . [14] L. Bertinetto , J. Valmadre , J.F. Henriques , A. Vedaldi , P.H.S. Torr , Fully-convolu-

tional Siamese networks for object tracking, in: Proceedings of the ECCV work-

shop, 2016 . [15] M. Kristan , J. Matas , A. Leonardis , M. Felsberg , L. Cehovin , G. Fernandez , T. Vo-

jir , et al. , The visual object tracking VOT2015 challenge results, in: Proceedingsof the ICCV Workshop, 2015, pp. 1–23 .

[16] O. Russakovsky , J. Deng , H. Su , J. Krause , S. Satheesh , S. Ma , Z. Huang , Karpa-thy A , A. Khosla , M. Bernstein , A.C. Berg , Fei-Fei L , ImageNet large scale visual

recognition challenge, in: Proceedings of the IJCV, 2015 .

[17] A.W. Smeulders , D.M. Chu , R. Cucchiara , S. Calderara , A. Dehghan , M. Shah ,Visual tracking: an experimental survey, TPAMI 36 (7) (2014) 1442–1468 1, 5,

8 . [18] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama,

and T. Darrell. Caffe: convolutional architecture for fast feature embedding.arXiv: 1408.5093 , 2014.

[19] L.T. Laura , C.F. Cristian , S. Konrad , Learning by tracking: Siamese CNN for ro-bust target association, The IEEE Conference on Computer Vision and Pattern

Recognition (CVPR) Workshops (2016) 418–425 .

[20] J. Zhang , S. Ma , S. Sclaroff, Meem: robust tracking via multiple experts usingentropy minimization, in: Proceedings of the ECCV, 2014 .

[21] S. Ren , K. He , R. Girshick , J. Sun , Faster r-cnn: Towards real-time object detec-tion with region proposal networks, in: Proceedings of the Advances in Neural

Information Processing Systems, 2015 .

22] N.Y. Wang , J. Shi , D.Y. Yeung , J. Jia , Understanding and diagnosing visual track-ing systems, in: Proceedings of the ICCV, 2015 .

23] Z. Hong , Z. Chen , C. Wang , X. Mei , D. Prokhorov , D. Tao , Multi-store tracker(muster): a cognitive psychology inspired approach to object tracking, in: Pro-

ceedings of the CVPR, 2015 . [24] J. Wang , H. Zhu , S. Yu , C. Fan , Object tracking using color-feature guided net-

work generalization and tailored feature fusion, Neurocomputing 238 (2017)387–398 .

25] P. Zhang , T. Zhou , W. Huang , et al. , Online object tracking based on CNN with

spatial-temporal saliency guided sampling, Neurocomputing 257 (2017) 1–13 . 26] G. Zhu , F. Porikli , H. Li , Tracking randomly moving objects on edge box pro-

posals, in: Proceedings of the CVPR, 2016 . [27] C.L. Zitnick , P. Doll ́ar , Edge boxes: Locating object proposals from edges, in:

Proceedings of the European Conference on Computer Vision (ECCV), 2014 . 28] M. Danelljan , G. Bhat , F.S. Khan , M. Felsberg , ECO: effective convolution opera-

tors for tracking, The IEEE Conference on Computer Vision and Pattern Recog-

nition (CVPR) Workshops (2017) 6931–6939 . 29] H. Nam, M. Baek, B. Han. Modeling and propagating CNNs in a tree structure

for visual tracking. arXiv: 1608.07242v1 [cs.CV] 2016. [30] H. Fan , H. Ling , SANet: structure-aware network for visual tracking, The IEEE

Conference on Computer Vision and Pattern Recognition (CVPR) Workshops(2017) 2217–2224 .

[31] Y. Wu, J. Lim, and M.-H. Yang. Object tracking benchmark. in: Proceedings of

the TPAMI, PrePrints. 32] M. Danelljan , G. Hager , F.S. Khan , M. Felsberg , Learning spatially regularized

correlation filters for visual tracking, in: Proceedings of the International Con-ference on Computer Vision, 2015 .

[33] M. Chao , J.B. Huang , X. Yang , M.H. Yang , Hierarchical convolutional features forvisual tracking, in: Proceedings of the International Conference on Computer

Vision (ICCV), 2015 .

[34] M. Danelljan , G. Häger , F. Khan , M. Felsberg , Convolutional features for corre-lation filter based visual tracking, in: Proceedings of the ICCV workshop, 2015 .

Han Zhang received the B.S. degree in electronics sci-ence and technology from Shanghai Jiao Tong University,

Shanghai, China, in 2010 and the M.S. degree from Na-tional University of Defense Technology, Changsha, China,

in 2012. She is currently working as Research Asso-

ciate in Northwest Institute of Nuclear Technology, Xi’an,China. Her research interests include multitemporal re-

mote sensing, image analysis and pattern recognition.

Weiping Ni was born in China in 1980. He received the

B.S., degree from University of Science and Technology of

China, Hefei, China, in 2004, the M.S. degree from Na-tional University of Defense Technology Changsha, China,

in 2006, and Ph.D. degree in pattern recognition and in-telligent system at Xidian University, Xi’an, China, in 2016.

From 2014 until now, he has been a Research Associatewith the Northwest Institute of Nuclear Technology, Xi’an,

China. His research interest includes remote sensing im-

age processing, automatic target recognition, and com-puter vision.

Weidong Yan was born in 1967. He received the B.S. and

M.S. degrees in electronic engineering from the Schoolof Electrical Engineering, National University of Defense

Technology, Changsha, China. He is currently a ResearcherFellow with the Northwest Institute of Nuclear Technol-

ogy, Xi’an, China. His research interests include remotesensing image analysis and pattern recognition.

Junzheng Wu received the B.S. (2008) degree in automa-

tion from Tsinghua University, China, and the M.S(2011)degree in signal and information processing from North-

west Institute of Nuclear Technology(NINT). He is cur-rently a Research Assistant of NINT, and his research in-

terests include computer vision, remote sensing imagesprocessing.

Page 11: Visual tracking using Siamese convolutional neural network with region …static.tongtianta.site/paper_pdf/514f1716-49ee-11e9-8da7-00163e08… · Siamese region proposal network is

H. Zhang et al. / Neurocomputing 275 (2018) 2645–2655 2655

Hui Bian was born in 1971, an Associate Research Fellow

with the Northwest Institute of Nuclear Technology, Xi’an,China. His research interests include image fusion, target

detection and pattern recognition.

Deliang Xiang received the B.S. degree in remote sensing

science and technology from Wuhan University, Wuhan,China, in 2010 and the M.S. degree from National Univer-

sity of Defense Technology, Changsha, China, in 2012. He

is currently pursuing the Ph.D. degree in microwave re-mote sensing at KTH Royal Institute of Technology, Stock-

holm, Sweden. His research interests include urban arearemote sensing, PolSAR image processing, and pattern

recognition.