journal of la siamsnn: spike-based siamese network for ... · for example, 512 time steps are...

8
JOURNAL OF L A T E X CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 1 SiamSNN: Spike-based Siamese Network for Energy-Efficient and Real-time Object Tracking Yihao Luo, Min Xu, Caihong Yuan, Xiang Cao, Yan Xu, Tianjiang Wang and Qi Feng Abstract—Although deep neural networks (DNNs) have achieved fantastic success in various scenarios, it’s difficult to employ DNNs on many systems with limited resources due to their high energy consumption. It’s well known that spiking neural networks (SNNs) are attracting more attention due to the capability of energy-efficient computing. Recently many works focus on converting DNNs into SNNs with little accuracy degradation in image classification on MNIST, CIFAR-10/100. However, few studies on shortening latency, and spike-based modules of more challenging tasks on complex datasets. In this paper, we focus on the similarity matching method of deep spike features and present a first spike-based Siamese network for object tracking called SiamSNN. Specifically, we propose a hybrid spiking similarity matching method with membrane potential and time step to evaluate the response map between exemplar and candidate images, with the same function as correlation layer in SiamFC. Then we present a coding scheme for utilizing temporal information of spike trains, and implement it in output spiking layers to improve the performance and shorten the latency. Our experiments show that SiamSNN achieves short latency and low precision loss of the original SiamFC on the tracking datasets OTB-2013, OTB-2015 and VOT2016. Moreover, SiamSNN achieves real-time (50 FPS) and extremely low energy consumption on TrueNorth. Index Terms—Spiking neural networks, object tracking, low energy consumption, short latency, real-time. I. I NTRODUCTION N OWADAYS Deep Neural Networks (DNNs) have shown remarkable performance in various scenarios [1], [2], [3]. However, due to the high computation cost and heavy energy consumption, it’s difficult to employ DNNs on embedded systems such as mobile devices. To this end, many tiny but efficient networks [4], [5] are proposed and achieve promising performance. However, the problem of insufficient computing power still exists in resource-constrained systems. As an alternative, spiking neural networks (SNNs) have realized ultra-low power consumption on neuromorphic hard- ware (such as TrueNorth [6] and Loihi [7]) by transmitting information in the form of an event-driven binary spike trains rather than continuous values like DNNs [8]. Furthermore, spiking neurons in SNNs only fire an output spike when their membrane potentials exceed a certain threshold [9], which This work was supported in part by the National Nature Science Foundation of China under Grant 61572214. (Corresponding author: Qi Feng.) Y. Luo, M. Xu, X. Cao, Y. Xu, T. Wang and Q. Feng are with School of Computer Science and Technology, Huazhong University of Science and Technology, Wuhan, Hubei, 430074, China (email: [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]). C. Yuan is with School of Computer and Information Engineering, Henan University, Kaifeng 475004, China (email: [email protected]). sparsely activates neurons in networks to save energy. SNNs are being regarded as the third generation artificial neural networks [10] because they are bio-inspired and more like a real human brain than the common DNNs. SNNs have shown the effectiveness in some scenarios, yet there is still a gap to DNNs owing to the hardness of training [11]. One major reason is that the training algorithms, such as STDP [12] and backpropagation [13], can only train shallow SNNs with two or three layers. In order to avoid this issue, an alternative way is converting trained DNNs to SNNs [14], [15], [16]. This aims to convert DNNs to SNNs by transferring the well-trained models with weight rescaling and normalization methods, and there would only be a small loss of accuracy compared to original DNNs. Recently [17] propose a hybrid coding scheme, which achieves near-lossless conversion in image classification. However, SNNs approaches that provide competitive results mostly in classification but few tasks beyond perception and inference on static images [18]. These methods are mainly suitable for convolutional neural networks (CNNs) with base layers like ReLU, pooling and batch normalization (BN). It’s necessary to design more spike-based modules for various models. The competitive results only demonstrate that the useful information for image classification have not been lost, and it’s not clear about the transmission of other in- formation. As far as we know, only Spiking-YOLO [19] successfully performs object detection by solving regression problem through channel-normalization and converting leaky- ReLU to a spiked fashion. In addition, SNNs of remarkable performance usually requires long latency, which causes the inability to reach real-time that limits applications of SNNs. Consequently, Focusing on shortening latency and exploiting various spike-based modules for more advanced tasks are significant to the popularization of SNNs. Object tracking, one of the most important problems in the field of computer vision with wide applications, is a more challenging task than image classification due to the difficulties of occlusions, deformation, background cluttering and other distractors [20]. An object bounding box is solely provided for the first frame of a sequence and the algorithm should estimate the target position in each remaining frame. Recently Siamese DNNs based trackers obtain state-of-the-art results [21], while SNNs only have few attempts in simple tracking scenarios [22], [23]. In this paper, we focus on the similarity matching method of deep spike trains and present a spike-based Siamese network for object tracking called SiamSNN based on SiamFC [3]. Specifically, we first implement a simple matching method arXiv:2003.07584v1 [cs.CV] 17 Mar 2020

Upload: others

Post on 27-May-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: JOURNAL OF LA SiamSNN: Spike-based Siamese Network for ... · For example, 512 time steps are needed to represent the positive integer 512. To represent 0.006, 6 spikes and 1000 time

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 1

SiamSNN: Spike-based Siamese Network forEnergy-Efficient and Real-time Object Tracking

Yihao Luo, Min Xu, Caihong Yuan, Xiang Cao, Yan Xu, Tianjiang Wang and Qi Feng

Abstract—Although deep neural networks (DNNs) haveachieved fantastic success in various scenarios, it’s difficult toemploy DNNs on many systems with limited resources due totheir high energy consumption. It’s well known that spikingneural networks (SNNs) are attracting more attention due tothe capability of energy-efficient computing. Recently manyworks focus on converting DNNs into SNNs with little accuracydegradation in image classification on MNIST, CIFAR-10/100.However, few studies on shortening latency, and spike-basedmodules of more challenging tasks on complex datasets. In thispaper, we focus on the similarity matching method of deep spikefeatures and present a first spike-based Siamese network forobject tracking called SiamSNN. Specifically, we propose a hybridspiking similarity matching method with membrane potentialand time step to evaluate the response map between exemplarand candidate images, with the same function as correlationlayer in SiamFC. Then we present a coding scheme for utilizingtemporal information of spike trains, and implement it in outputspiking layers to improve the performance and shorten thelatency. Our experiments show that SiamSNN achieves shortlatency and low precision loss of the original SiamFC on thetracking datasets OTB-2013, OTB-2015 and VOT2016. Moreover,SiamSNN achieves real-time (50 FPS) and extremely low energyconsumption on TrueNorth.

Index Terms—Spiking neural networks, object tracking, lowenergy consumption, short latency, real-time.

I. INTRODUCTION

NOWADAYS Deep Neural Networks (DNNs) have shownremarkable performance in various scenarios [1], [2], [3].

However, due to the high computation cost and heavy energyconsumption, it’s difficult to employ DNNs on embeddedsystems such as mobile devices. To this end, many tiny butefficient networks [4], [5] are proposed and achieve promisingperformance. However, the problem of insufficient computingpower still exists in resource-constrained systems.

As an alternative, spiking neural networks (SNNs) haverealized ultra-low power consumption on neuromorphic hard-ware (such as TrueNorth [6] and Loihi [7]) by transmittinginformation in the form of an event-driven binary spike trainsrather than continuous values like DNNs [8]. Furthermore,spiking neurons in SNNs only fire an output spike when theirmembrane potentials exceed a certain threshold [9], which

This work was supported in part by the National Nature Science Foundationof China under Grant 61572214. (Corresponding author: Qi Feng.)

Y. Luo, M. Xu, X. Cao, Y. Xu, T. Wang and Q. Feng are with Schoolof Computer Science and Technology, Huazhong University of Science andTechnology, Wuhan, Hubei, 430074, China (email: [email protected];[email protected]; [email protected]; [email protected];[email protected]; [email protected]).

C. Yuan is with School of Computer and Information Engineering, HenanUniversity, Kaifeng 475004, China (email: [email protected]).

sparsely activates neurons in networks to save energy. SNNsare being regarded as the third generation artificial neuralnetworks [10] because they are bio-inspired and more likea real human brain than the common DNNs.

SNNs have shown the effectiveness in some scenarios,yet there is still a gap to DNNs owing to the hardness oftraining [11]. One major reason is that the training algorithms,such as STDP [12] and backpropagation [13], can only trainshallow SNNs with two or three layers. In order to avoidthis issue, an alternative way is converting trained DNNs toSNNs [14], [15], [16]. This aims to convert DNNs to SNNsby transferring the well-trained models with weight rescalingand normalization methods, and there would only be a smallloss of accuracy compared to original DNNs. Recently [17]propose a hybrid coding scheme, which achieves near-losslessconversion in image classification.

However, SNNs approaches that provide competitive resultsmostly in classification but few tasks beyond perception andinference on static images [18]. These methods are mainlysuitable for convolutional neural networks (CNNs) with baselayers like ReLU, pooling and batch normalization (BN). It’snecessary to design more spike-based modules for variousmodels. The competitive results only demonstrate that theuseful information for image classification have not beenlost, and it’s not clear about the transmission of other in-formation. As far as we know, only Spiking-YOLO [19]successfully performs object detection by solving regressionproblem through channel-normalization and converting leaky-ReLU to a spiked fashion. In addition, SNNs of remarkableperformance usually requires long latency, which causes theinability to reach real-time that limits applications of SNNs.Consequently, Focusing on shortening latency and exploitingvarious spike-based modules for more advanced tasks aresignificant to the popularization of SNNs.

Object tracking, one of the most important problems inthe field of computer vision with wide applications, is amore challenging task than image classification due to thedifficulties of occlusions, deformation, background clutteringand other distractors [20]. An object bounding box is solelyprovided for the first frame of a sequence and the algorithmshould estimate the target position in each remaining frame.Recently Siamese DNNs based trackers obtain state-of-the-artresults [21], while SNNs only have few attempts in simpletracking scenarios [22], [23].

In this paper, we focus on the similarity matching method ofdeep spike trains and present a spike-based Siamese networkfor object tracking called SiamSNN based on SiamFC [3].Specifically, we first implement a simple matching method

arX

iv:2

003.

0758

4v1

[cs

.CV

] 1

7 M

ar 2

020

Page 2: JOURNAL OF LA SiamSNN: Spike-based Siamese Network for ... · For example, 512 time steps are needed to represent the positive integer 512. To represent 0.006, 6 spikes and 1000 time

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 2

thorough membrane potential or time step. After our in-depthanalysis, we optimize them and propose a hybrid similaritymatching method for utilizing temporal information of spiketrains. Meanwhile, we present a coding scheme in outputspiking layer to optimize the temporal distribution in spiketrains for shortening the latency.

The proposed coding scheme are inspired by the neuralphenomenon of long-term depression (LTD) [24]. To the bestof our knowledge, SiamSNN is the first deep SNN for objecttracking that achieves short latency and low precision loss ofthe original SiamFC on the tracking datasets OTB-2013 [25],OTB-2015 [26] and VOT2016 [27]. The contributions of thispaper can be summarized as follows:

• We design SiamSNN for object tracking in deep SNNwith short latency and low precision loss. This is the firstsuccessful attempt to apply deep SNN to object trackingon the datasets with complex scenes.

• We propose a hybrid similarity matching method toconstruct spike-base correlation layer to evaluate theresponse map between exemplar and candidate images.The proposed method exploits temporal information tocut down accuracy degradation.

• We propose a novel coding scheme in the output spikinglayer to optimize the temporal distribution of spike trainsthat improves the performance and shortens the latency.

II. RELATED WORK

A. Deep SNNs through conversion

Many researches have been proposed for efficient SNNstraining [12], [13] and achieve certain effects. However, theyusually suffer accuracy degradation compared to the DNNs.One major reason is that the training algorithms only fitshallow SNNs with two or three layers but DNNs are muchdepper with great feature extraction and information process-ing capability. In order to construct deep SNNs, an alternativeway of directly training deep SNNs is converting trainedDNNs to SNNs with the same topology.

In the early stage work, Cao et al. [14] converts DNNsto SNNs and achieve excellent performance. They propose acomprehensive conversion scheme that can convert most mod-ules in CNN-based networks except bias, batch normalization(BN) and max-pooling layers. Diehl et al. [15] proposes data-based normalization method to enable a sufficient firing rateand achieve nearly lossless accuracy on MNIST [28] comparedto DNNs.

In the subsequent work, Rueckauer et al. [16] demonstratesthat merging BN layer to the preceding layer is losslessand introduce spike max-pooling. Kim et al. [19] proposeschannel-wise normalization to upgrade firing rate and firstapply deep SNN to object detection successfully. Senguptaet al. [29] introduces a novel spike-normalization method forgenerating an SNN with deeper architecture and experimenton complex standard vision dataset ImageNet [30]. Thesemethods have achieved comparable performance to DNNs inimage classification and object detection, while there are fewattempts to apply SNNs to matching problems such objecttracking.

B. Spike coding

Spike coding is the method of representing information withspike trains. Rate and temporal coding are the most frequentlyused coding schemes in SNNs.

Rate coding is based on spike firing rate, which is used asthe signal intensity, counted by the number of spikes occurredduring a period of time. Thus, it has been widely used inconverted SNNs [14], [16], [31] for classification by compar-ing the firing rates of each categories. However, rate codingrequires a higher latency for more accurate information [32].For example, 512 time steps are needed to represent thepositive integer 512. To represent 0.006, 6 spikes and 1000time steps are required. As a result, rate coding is inefficient interms of information transmission because of its redundancy,which leads to a long latency and energy consumption in deepSNNs.

Temporal coding uses timing information to mitigate longlatency. Time-to-first-spike [33] only utilizes the arrival timeof first spike during a certain time. And there are many othercoding schemes exploiting temporal information, such as rank-order coding [34] and phase coding [35]. On the basis of these ,Kim et al. [36] proposes weighted spikes by the phase functionto make the most of temporal information. Park et al. [17]introduces burst coding scheme inspired by neuroscience re-search, and investigates a hybrid coding scheme to exploitdifferent features of different layers in SNNs.

C. Object tracking

Object tracking aims to estimate the position of arbitrarytarget in a video sequence while its location only initializesin the first frame by a bounding box. Trackers usually learn amodel of the object’s appearance and match with the candidateimage in current frame to determine the location by maximumresponse point. Tracking scenarios often appear occlusions,out-of-view, deformation, background cluttering and othervariations, which makes object tracking more challenging [37].

Two main branches of recent trackers are based on corre-lation filter (CF) [38], [39] and DNNs [40], [41], [42]. CFtrackers train regressors in the Fourier domain and update theweights of filters to do online tracking. Trackers based onDNNs are trained end-to-end and off-line, which enhance therichness of model by big data sets. Motivated by CF, trackersbased on Siamese networks with similarity comparison strat-egy have drawn great attention due to their high performance.SiamFC [3] introduces the correlation layer rather than fullyconnected layers to evaluate the regional feature similaritybetween two frames, which highly improve the accuracy.

Many researches focus on improving the architec-ture of Siamese network and get state-of-the-art results.SiamRPN [43] adopts a region proposal network (RPN [44])after a Siamese network and enhance tracking perfor-mance, the object tracking task is decomposed into classi-fication and regression problem with correlation operation.SiamRPN++ [21] proposes a new model architecture to per-form layer-wise and depth-wise aggregations and successfullytrain a Siamese tracker with deeper CNNs. Liet al. [45]

Page 3: JOURNAL OF LA SiamSNN: Spike-based Siamese Network for ... · For example, 512 time steps are needed to represent the positive integer 512. To represent 0.006, 6 spikes and 1000 time

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 3

Fig. 1: Framework of SiamSNN. This framework consists of a Siamese spiking convolutional neural network backbone, two-status coding,and hybrid spiking similarity matching (temporal and potential correlation). Two-status coding optimizes the temporal distribution of outputspike trains. Hybrid spiking similarity matching calculates the response map which denotes the similarity score between the template andthe search region and the maximum of the response map indicates the target position.

investigates a method to enhance tracking robustness andaccuracy by learning target-aware features.

III. PROPOSED METHODS

A. Model Overview

SiamFC [3] first introduces the correlation layer to calculatethe response map and it’s simple but efficient. As our basemodel, SiamFC can be reconstructed by existing spikingmodules except for correlation layer, which aims to find themost similar instance to exemplar z from candidate frame xthrough an embedding function ϕ:

f(z, x) = ϕ(z) ∗ ϕ(x) + b · 1, (1)

where b · 1 denotes a bias equated in every location.The converted SiamFC is called SiamSNN, in which

the max-pooling and BN layers are implemented accordingto [16]. We adopt layer-wise normalization (abbreviated tolayer-norm) [15], [16] to prevent the neuron from over- orunder-activation. We detail the SiamSNN as shown in Fig. 1.

Different from DNNs, SNNs uses event-driven binary spiketrains rather than continuous values between neurons. Theintegrate-and-fire (IF) neuron model adopts membrane poten-tial Vmem to simulate the post-synaptic potential. The mem-brane potential in the lth layer for each neuron is describedas

V lmem,i(t) = V lmem,i(t− 1) + zli(t)− Vth(t)Θli(t), (2)

where Θli(t) is a spike in the ith neuron in lth layer, Vth(t) is

a certain threshold, and zli(t) is the input of the ith neuron inlth layer, which is described as

zli(t) =∑j

wljiVth(t)Θl−1j (t) + bli, (3)

where wl is the weight and bl is the bias in the lth layer.Each neuron will generate a spike only if their membrane

potential exceeds a certain threshold Vth(t). The process ofspike generation can be generalized as

Θli(t) = U

(V lmem,i(t− 1) + zli(t)− Vth(t)

), (4)

where U (·) is a unit step function. Once the spike is generated,the membrane potential is reset to the resting potential. Asshown in Eq. 2, we adopt the method of resetting by subtrac-tion rather than resetting to zero, which increases firing rateand reduces the loss of information. The firing rate is definedas

firing rate =N

T, (5)

where N is the total number of spikes during a given period T .The maximum firing rate will be 100% since neuron generatesa spike at every time step. Spike coding is the method ofrepresenting information with spike trains and different codingschemes cause various firing rates. A typical coding scheme isthat input layer with real coding (real value) and hidden layerswith rate coding [17].

B. Analysis

Like softmax layer, we can simply infer the responsemap output from the correlation computed on the membranepotentials of the last layer over the entire time. Hence, Eq. 1can be changed into a spiking form:

MV (z, x) =1

2T

(T∑t=1

ϕ(z(t))

)∗

(T∑t=1

ϕ(x(t))

), (6)

where z(t) and x(t) are the resulting spike trains encodedthrough input image for each pixel, T is the latency. In the restof the paper, we will refer to this simple method as potentialspiking matching (PSM).

We preliminary test PSM with layer-norm on OTB-2013,unfortunately, it suffers from severe performance degra-dation. To prevent under-activation from layer-norm, we

Page 4: JOURNAL OF LA SiamSNN: Spike-based Siamese Network for ... · For example, 512 time steps are needed to represent the positive integer 512. To represent 0.006, 6 spikes and 1000 time

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 4

(a) AUC on OTB-2013 (b) Average firing rates

Fig. 2: (a).The area under the curve (AUC) of success plots on OTB-2013 dataset. (b).Average firing rates of the three methods in conv1-4.

adopt channel-wise data-based normalization (abbreviated tochannel-norm) [19], which first introduces in object detectioninstead of layer-norm to gain more activations in short latency.Additionally, we implement burst coding scheme [17] in ourhidden layers to upgrade firing rate by generating a group ofshort-ISI spikes. Surprisingly, the result turns out to be loweras Fig. 2(a) shows. Although these two methods cause highfiring rates, they don’t improve accuracy.

There is no doubt that the main reason for a drop ofperformance is a reduction of firing rates in higher layers [31],so many researches aim at increasing the firing rate to enhancethe efficiency of the information transmission, although largernumber of spikes incurs more dynamic energy consumptionand significantly diminishes the merit of low energy con-sumption in deep SNNs [36]. High firing rate can ensurethe information transmission of small activations, which isbeneficial to image classification or detection, especially inthe deep layers for regression. But it can also bring morebackground information to obstruct similarity matching.

It’s indicated that only a few convolutional filters are activein describing the target in tracking and others contain irrelevantand redundancy information because inter-class differences aremainly related to a few feature channels [45]. For this reason,channel-wise normalization brings more irrelevant informationin tracking due to enhancing the activations of all channels.Moreover, burst coding scheme also strongly activates neuronsin all channels by generating a group of short inter-spikeinterval (ISI) spikes. In addition, small values of the short-ISIspikes result small response values in correlation operations,which will narrow the similarity gap between foreground andbackground.

Our in-depth analysis demonstrates latent reasons for theperformance degradation by increasing firing rates of allchannels. Therefore, we still utilize layer-norm to balance theweights and threshold voltage. However, as Fig. 2(b) shows,firing rate of most neurons in layer-norm is below 10%, whichresults in long latency in order to gain more activations. In thiscase, we expect for more similarity information to acceleratematching and improve accuracy. Temporal information ofspike trains are the main difference between DNNs and SNNs,which makes an important impact on information transmissionin SNNs [36], [18]. Motivated by this, we exploit temporalinformation in similarity matching method and coding scheme.

Fig. 3: Illustration of temporal correlation between spike train A andB.

C. Hybrid Spiking Similarity Matching

Similar to PSM, we can match the temporal similaritybetween spiking features of exemplar and candidate images(abbreviated to TSM). Essentially, correlation operation isalmost the same as convolution, it can be calculated asconvolution per time step in SNNs as follows:

MT (z, x) =1

2T

T∑t=1

(ϕ(z(t)) ∗ ϕ(x(t))) , (7)

However, it causes strict time consistency of correlation,which can also lead to a large drop of accuracy in actualmeasurement. For instance in Fig. 3, if spike train A of a pixelhas spikes at ti−1, and spike train B has spikes at ti, theirtemporal correlation is 0 during (ti−2, ti+1), but intuitivelythey are highly similar . The measures of spike train synchronyin [46] also show high similarity of the above two spike trains.Therefore, we introduce a response period with time weightsduring matching to solve this issue. It’s defined as:

M∆τ (z, x) =1

2T

T∑t=1

t+τ∑m=t−τ

(ϕ(z(m)) ∗ ϕ(x(t))

2|t−m|+ 1

), (8)

where τ is the response period threshold. As shown in Fig. 3,Eq. 8 makes spike trains A response with B during [tk −τ, tk+τ ]). Response value will attenuate gradually when timestep far from tk. There is a trade-off between computationconsumption and period threshold, and longer period has fewcontributions to the response value. Thus, τ is usually set toa small number like 2.

Considering PSM performs better than TSM on portions ofsequences in OTB-2013, we calculate the final response mapas follows:

fspike(z, x) = λvMV (z, x) + λtM∆τ (z, x) + b · 1, (9)

where λv and λt are the weights to do trade-off and nor-malization, b · 1 denotes the same bias in SiamFC. Notedthat in SiamFC, f(z, x) compares an exemplar image z toa candidate image x in current frame of the same size. Itgenerates high value if the two images describe the sameobject and low otherwise. And fspike(z, x) just simulates thefunction, not strictly calculates similarity between two spiketrains (the similarity equals 1 between two zero spike trains).It’s the reason for using 2T instead of spike numbers as thenormalization parameter to get response value.

Page 5: JOURNAL OF LA SiamSNN: Spike-based Siamese Network for ... · For example, 512 time steps are needed to represent the positive integer 512. To represent 0.006, 6 spikes and 1000 time

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 5

Fig. 4: Example of spiking value and temporal distribution in constantthreshold, weighted spike, burst coding and two-status coding.

It’s worth noting that almost all tracking models based onSiamese networks have the correlation layer, and many ofthem obtain state-of-the-art results [21], [45]. The proposedmethod makes it possible to convert these trackers to deepSNNs. Meanwhile, Siamese network has been used in variousfields, such as person re-identification and face recognition.Our approaches provide a reference for similarity matching ofspike trains in deep SNNs. In the rest of the paper, we willrefer to hybrid spiking similarity matching as HSM.

D. Two-Status Coding Scheme

Although Eq. 9 enhances the temporal correlation betweentwo spike trains, the temporal distributions of spikes arerandom and disorganized, which makes a large portion of spikevalues underutilized. We expect that the output spike of eachtime step will contribute to the final response value, whichshortens latency for reducing energy consumption.

Weighted spikes [36] are implemented by changing voltagethreshold Vth(t) depending on the phase of global referenceclock, which cyclically produces spikes with phase value. Thefunction of phase is given by

Vth(t) = 2−(1+ mod (t−1,K)) (10)

where K is the period of the phase. Weighted spikes needsonly K time steps to represent a K-bit data whereas rate codingwould require 2K time steps to represent the same data.

Similar as the exponential decay of threshold, Burst cod-ing [17] generates a group of short-ISI spikes as follows:

Vth,i(t) =

{γ · Vth,i(t) if Θl

i(t− 1) = 11.0 otherwise (11)

where γ is a burst constant. Threshold in burst coding changeswhile the neuron generates a spike, instead of changing peri-odically in weighted spikes. These two methods can transmitmore information with spikes than rate coding for the sametime steps, which results in shortened latency and higherenergy efficiency in classification. However, as analyzed be-fore, small values of spikes result in small response values

in correlation operations. To this end, we need periodic spikewith same value.

Motivated by the neural phenomenon that repetitive electri-cal activity can induce a persistent potentiation or depressionof synaptic efficacy in various parts of the nervous system(LTP and LTD) [24], we propose two-status coding scheme(hereinafter abbreviated as TS coding), which represents thevoltage threshold of potentiation status and depression status.We define the function as follows:

Vth(t) =

{α if mod (t/p, 2) = 1β otherwise , (12)

where α → +∞ to prevent generating spikes and β is oftensmaller than 1, p is a constant that controls the period ofneuron state change, t is the current time step.

We illustrate the spiking value and temporal distributionof spikes on different methods in Fig. 4. Constant thresholdgenerates spikes disorderly with stable values, weighted spikewith different values usually emerge in low threshold, burstcoding fires a group of short-ISI spikes with exponentiallydecreasing values. Our approach makes neurons only fire at thesame value in potentiation status and accumulate membranepotential in depression status. We constrain equivalent spikesin a fixed periodic distribution, increase the density of spikesin potentiation status to enhance response value. The proposedmethod can also save energy by avoiding neurons generatingspikes that are useless for matching. Moreover, spiking neu-rons in the next layer are not consuming during depressionstatus due to zero input.

Park et al. [17] proposes a hybrid coding scheme on inputand hidden layers by the motivation that neurons occasionallyuse different neural coding schemes depending on their rolesand locations in the brain. Inspired by this idea, we onlyuse TS coding scheme on the last (output) spiking layer,which is the input for correlation, because our method onlyoptimizes the temporal distribution for better matching withoutadvantages in transmission on hidden layers. And we use realvalue for the input layer due to its fast and accurate features.

IV. EXPERIMENTAL RESULTS

A. Experimental Settings

We implement SiamSNN and the weights and biases areconverted from SiamFC, which has simple architecture butex excellent performance. We evaluate our methods on OTB-2013, OTB-2015 and VOT2016. The simulation and imple-mentations are based on TensorFlow.

OTB and VOT are the widely used tracking benchmarks,which contain varieties of tracking challenging about out-of-view, variation, occlusions, deformation, fast motion, rotation,and background cluttering. OTB datasets adopt overlap scoreand center location error as their base metrics. The successplot shows the ratios of successful frames whose overlap ratiois larger than a given threshold. The precision plot measuresthe percentage of frames whose center location error in rangeof a certain threshold. The area under the curve (AUC) of thesuccess plot is usually used to rank tracking algorithms. Asfor VOT, the performance is evaluated in terms of accuracy

Page 6: JOURNAL OF LA SiamSNN: Spike-based Siamese Network for ... · For example, 512 time steps are needed to represent the positive integer 512. To represent 0.006, 6 spikes and 1000 time

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 6

TABLE I: The average overlap ratio across all the videos of various the challenges on OTB100, about the four configuration of SiamSNNand SiamFC. The challenges are abbreviated as follows: background clutters (BC), deformation (DEF), fast motion (FM), in-plane rotation(IPR), illumination variation (IV), low resolution (LR), motion blur (MB), occlusion (OCC), out-of-plane rotation (OPR), out-of-view (OV),scale variation (SV).

overlap(%) ALL BC DEF FM IPR IV LR MB OCC OPR OV SVSiamFC 67.2 65.4 62.2 67.4 66.4 65.6 69.3 69 65.5 65.5 66.6 66.4

SiamSNN+PSM 59.6 56.3 54.9 59.2 60.2 56.1 62 63.4 58.7 57.1 56.2 57.3SiamSNN+TSM 60.4 56.7 56.7 61.4 61.4 56.9 57.4 63.8 59.6 57.9 54.2 57.7SiamSNN+HSM 60.5 56.8 57.6 60.8 60.7 56.5 57.3 63.6 60.4 57.8 53.9 58

SiamSNN+HSM+TS 61.7 59.9 58.4 61.3 61.6 59.8 59.5 64 61.1 59.8 54.8 59.1

(a) Results on the OTB-2013 dataset

(b) Results on the OTB-2015 dataset

Fig. 5: Success and precision plots on the OTB-2013 and OTB-2015datasets.

(average overlap) and robustness (failure times). The overallperformance is evaluated through Expected Average Overlap(EAO) which takes account of both accuracy and robustness.

In the evaluations, we find that when the response periodthreshold τ is set to 1, the ratio of λv : λt is about 1:20, theneuron state period p is set to 10, and β is set between 0.6and 0.7 , the tracker is relatively effective. Moreover, whenα = 1000 is enough to prevent neurons generating spikes.

B. Tracking Results

To assess and verify the proposed methods, we use overlapratio (OR) as the basic metric firstly. OR is the Intersectionover Union (IoU) between tracked bounding box and groundtruth bounding box, which is calculated based on objectdetection challenge VOC [47]:

OR =area (Ot ∩Ogt)area (Ot ∪Ogt)

(13)

Because that no spike-based methods were used to convertcorrelation layer before, although PySpike [46] can measuredistance between two single spike trains, it’s hard to be imple-mented in deep SNNs. Thus, we only compare the four config-uration of SiamSNN with SiamFC, compute the average ORacross all the videos of various the challenges on OTB100, asshown in Table I. SiamSNN with HSM and TS coding achievesthe least accuracy degradation (5.5%) compared to SiamFC

(61.7% vs. 67.2%). For further evaluation, we measure theprecision and success plots of OPE on OTB-2013 and OTB-2015. As depicted in Fig. 5, the target AUC of precision andsuccess plots in SiamFC are 79%, 59.19% (OTB-2013) and77%, 57.78% (OTB-2015). And SiamSNN+HSM+TS achieves72%, 52.66% (OTB-2013) and 68%, 50.72% (OTB-2015).SiamSNN+HSM+TS achieves outstanding performance thanother algorithms.

We find that SiamSNN will reduce accuracy more or lessin each type of tracking challenge attributes, especially in lowresolution (LR) and out-of-view (OV). We choose some videosfor in-depth analysis. Fig. 6 depicts the tracking results of threetrackers (SiamFC, SiamSNN+PSM and SiamSNN+HSM+TS)in several challenging sequences. SiamSNN has almost thesame performance as SiamFC in most simple tracking sce-narios. However the converted spike-based model will haveweaker discrimination abilities, it is likely to make wrongprecisions in case of challenging scenarios.

In occlusion, motion blur and scale variation, the per-formance of SiamSNN+HSM+TS is closer to SiamFC thanSiamSNN+PSM. PSM often drift the target when heavy oc-clusions and scale variations occur in car1, david3, human7,jogging1.

In background clutters, plane rotation, fast motion, bothSiamSNN and SiamFC will drifts off the target when encoun-tering challenging circumstances. So that SiamSNN unexpect-edly performs well than SiamFC in the basketball sequence,and they all drift away in the bolt2 and skating2-1 sequence.

In low resolution, out-of-view, although SiamSNN canscarcely track the target in the sequences of lemming, surferand tiger1, the bounding boxes always deviate from groundtruths. .

On OTB-2015, overlap ratio will drop sharply onceSiamSNN drifts off the target, since the tracker will not bereset on OTB benchmark. Thus, we compare SiamFC andSiamSNN+HSM+TS on VOT2016. As shown in Table II,the accuracy only drops 3% but robustness becomes poor,and the degradation of EAO is 2%. From here we see thatthe converted spike-based model lose some discriminativefeature, which causes SiamSNN to drift off the target whenencountering complex circumstances.

C. Energy Consumption and Speed Evaluation

The latency requirements are presented in Fig. 7, SiamSNNconverges rapidly in 10 time steps and then rises slowly,long latency contributes little to the accuracy. To reach themaximum AUC, SiamSNN+HSM+TS requires approximately

Page 7: JOURNAL OF LA SiamSNN: Spike-based Siamese Network for ... · For example, 512 time steps are needed to represent the positive integer 512. To represent 0.006, 6 spikes and 1000 time

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 7

Fig. 6: Tracking results on 10 sequences of OTB-2015 (from left to right and top to down are basketball, bolt2, car1, david3, human7,jogging1, lemming, skating2-1, surfer, tiger1). The indexes are shown in the top left of each frame.

TABLE II: Comparison results about SiamFC and SiamSNN onVOT2016.

A R EAO

SiamFC 0.53 0.49 0.23SiamSNN 0.50 0.63 0.21

Fig. 7: Results of SiamSNN on different configurations evaluated bylatency and the AUC of success plots on OTB-2015.

20 time steps. In other tasks, burst coding demands 3000 timesteps for image classification on CIFAR-100 [17] and Spiking-YOLO needs 3500 time steps for object detection [19].HSM+TS achieves a remarkable performance of latency,which provides great possibilities to achieve real-time onneuromorphic chips.

We estimate energy consumption and speed from neuromor-phic chips (TrueNorth [6]) to investigate the effect of our meth-ods, and compare them to SiamFC. Bertinetto et al. [3] runSiamFC(3s) on a single NVIDIA GeForce GTX Titan X (250

TABLE III: Comparison of speed and energy consumption of SiamFCand SiamSNN

Methods FLOPs/SOPS Power (W) Time (ms) Energy (J)

SiamFC 5.44E+09 250 11.63 2.9

SiamSNN+ HSM+ TS 1.10E+09 2.75E-03 20 5.5E-05

Watts) and reach 86 frames-per-second (11.63ms per frame).TrueNorth measured computation by synaptic operations persecond (SOPS) and can deliver 400 billion SOPS per Watt,while floating-point operations per second (FLOPS) in modernsupercomputers. And the time step is nominally 1ms, set bya global 1kHz clock [6]. We count the operations with theformula in [16].

The calculation results of processing one frame are pre-sented in Table III. The energy consumptions of SiamSNNon TrueNorth are extremely lower than SiamFC on GPU.It’s worth noting that our proposed methods reach 50 FPSon tracking, while SNNs requires long latency in imageclassification and object detection. We can expect that ourSiamSNN can be implemented on many embedded systemsand applied widely in CV scenarios.

V. CONCLUSION

In this paper, we design SiamSNN, the first deep SNNmodel for object tracking with short latency and low pre-cision loss compared to SiamFC on OTB-2013, OTB-2015and VOT2016 datasets. Our analysis indicates that enhancingfiring rates of all neurons has no contribution for tracking.Consequently, we propose hybrid spiking similarity matching

Page 8: JOURNAL OF LA SiamSNN: Spike-based Siamese Network for ... · For example, 512 time steps are needed to represent the positive integer 512. To represent 0.006, 6 spikes and 1000 time

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 8

method and two-status coding scheme for taking full advantageof temporal information in spike trains, which achieves real-time on TrueNorth. We believe that our methods can be appliedin more Siamese networks for tracking, person re-identificationand face recognition.

REFERENCES

[1] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classificationwith deep convolutional neural networks,” in Advances in neural infor-mation processing systems, 2012, pp. 1097–1105.

[2] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only lookonce: Unified, real-time object detection,” in Proceedings of the IEEEconference on computer vision and pattern recognition, 2016, pp. 779–788.

[3] L. Bertinetto, J. Valmadre, J. F. Henriques, A. Vedaldi, and P. H. Torr,“Fully-convolutional siamese networks for object tracking,” in ECCV.Springer, 2016, pp. 850–865.

[4] X. Yu, T. Liu, X. Wang, and D. Tao, “On compressing deep models bylow rank and sparse decomposition,” in CVPR, 2017, pp. 7370–7379.

[5] Y. Wei, X. Pan, H. Qin, W. Ouyang, and J. Yan, “Quantization mimic:Towards very tiny cnn for object detection,” in ECCV, 2018, pp. 267–283.

[6] P. A. Merolla, J. V. Arthur, R. Alvarez-Icaza, A. S. Cassidy, J. Sawada,F. Akopyan, B. L. Jackson, N. Imam, C. Guo, Y. Nakamura et al., “Amillion spiking-neuron integrated circuit with a scalable communicationnetwork and interface,” Science, vol. 345, no. 6197, pp. 668–673, 2014.

[7] M. Davies, N. Srinivasa, T.-H. Lin, G. Chinya, Y. Cao, S. H. Choday,G. Dimou, P. Joshi, N. Imam, S. Jain et al., “Loihi: A neuromorphicmanycore processor with on-chip learning,” IEEE Micro, vol. 38, no. 1,pp. 82–99, 2018.

[8] N. K. Kasabov, “Neucube: A spiking neural network architecture formapping, learning and understanding of spatio-temporal brain data,”Neural Networks, vol. 52, pp. 62–76, 2014.

[9] S. Ghosh-Dastidar and H. Adeli, “Spiking neural networks,” Interna-tional journal of neural systems, vol. 19, no. 04, pp. 295–308, 2009.

[10] W. Maass, “Networks of spiking neurons: the third generation of neuralnetwork models,” Neural networks, vol. 10, no. 9, pp. 1659–1671, 1997.

[11] A. Tavanaei, M. Ghodrati, S. R. Kheradpisheh, T. Masquelier, andA. Maida, “Deep learning in spiking neural networks,” Neural Networks,2018.

[12] N. Caporale and Y. Dan, “Spike timing–dependent plasticity: a hebbianlearning rule,” Annu. Rev. Neurosci., vol. 31, pp. 25–46, 2008.

[13] Y. Jin, W. Zhang, and P. Li, “Hybrid macro/micro level backpropagationfor training deep spiking neural networks,” in Advances in NeuralInformation Processing Systems, 2018, pp. 7005–7015.

[14] Y. Cao, Y. Chen, and D. Khosla, “Spiking deep convolutional neuralnetworks for energy-efficient object recognition,” International Journalof Computer Vision, vol. 113, no. 1, pp. 54–66, 2015.

[15] P. U. Diehl, D. Neil, J. Binas, M. Cook, S.-C. Liu, and M. Pfeiffer,“Fast-classifying, high-accuracy spiking deep networks through weightand threshold balancing,” in IJCNN. IEEE, 2015, pp. 1–8.

[16] B. Rueckauer, I.-A. Lungu, Y. Hu, M. Pfeiffer, and S.-C. Liu, “Con-version of continuous-valued deep networks to efficient event-drivennetworks for image classification,” Frontiers in neuroscience, vol. 11,p. 682, 2017.

[17] S. Park, S. Kim, H. Choe, and S. Yoon, “Fast and efficient informationtransmission with burst spikes in deep spiking neural networks,” inDesign Automation Conference. ACM, 2019, p. 53.

[18] K. Roy, A. Jaiswal, and P. Panda, “Towards spike-based machineintelligence with neuromorphic computing,” Nature, vol. 575, no. 7784,pp. 607–617, 2019.

[19] S. Kim, S. Park, B. Na, and S. Yoon, “Spiking-yolo: Spiking neural net-work for real-time object detection,” arXiv preprint arXiv:1903.06530,2019.

[20] Z. Zhu, Q. Wang, B. Li, W. Wu, J. Yan, and W. Hu, “Distractor-awaresiamese networks for visual object tracking,” in Proceedings of theEuropean Conference on Computer Vision (ECCV), 2018, pp. 101–117.

[21] B. Li, W. Wu, Q. Wang, F. Zhang, J. Xing, and J. Yan, “Siamrpn++:Evolution of siamese visual tracking with very deep networks,” in CVPR,2019, pp. 4282–4291.

[22] G. Pedretti, V. Milo, S. Ambrogio, R. Carboni, S. Bianchi, A. Calderoni,N. Ramaswamy, A. Spinelli, and D. Ielmini, “Memristive neural networkfor on-line learning and tracking with brain-inspired spike timingdependent plasticity,” Scientific reports, vol. 7, no. 1, p. 5288, 2017.

[23] Z. Cao, L. Cheng, C. Zhou, N. Gu, X. Wang, and M. Tan, “Spikingneural network-based target tracking control for autonomous mobilerobots,” Neural Computing and Applications, vol. 26, no. 8, pp. 1839–1847, 2015.

[24] G.-q. Bi and M.-m. Poo, “Synaptic modifications in cultured hip-pocampal neurons: dependence on spike timing, synaptic strength, andpostsynaptic cell type,” Journal of neuroscience, vol. 18, no. 24, pp.10 464–10 472, 1998.

[25] Y. Wu, J. Lim, and M.-H. Yang, “Online object tracking: A benchmark,”in CVPR, 2013, pp. 2411–2418.

[26] Y. Wu, J. Lim, and M.-H.Yang, “Object tracking benchmark,” IEEETransactions on Pattern Analysis and Machine Intelligence, vol. 37,no. 9, pp. 1834–1848, 2015.

[27] J. M. M. F. R. P. P. M. Kristan, A. Leonardis and et al, “The visualobject tracking VOT2016 challenge results,” in ECCV 2016 Workshops,2016, pp. 777–823.

[28] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learningapplied to document recognition,” Proceedings of the IEEE, vol. 86,no. 11, pp. 2278–2324, 1998.

[29] A. Sengupta, Y. Ye, R. Wang, C. Liu, and K. Roy, “Going deeper inspiking neural networks: Vgg and residual architectures,” Frontiers inneuroscience, vol. 13, 2019.

[30] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma,Z. Huang, A. Karpathy, A. Khosla, M. Bernstein et al., “Imagenet largescale visual recognition challenge,” International journal of computervision, vol. 115, no. 3, pp. 211–252, 2015.

[31] B. Rueckauer, I.-A. Lungu, Y. Hu, and M. Pfeiffer, “Theory and toolsfor the conversion of analog to spiking convolutional neural networks,”arXiv preprint arXiv:1612.04052, 2016.

[32] J. Gautrais and S. Thorpe, “Rate coding versus temporal order coding:a theoretical approach,” Biosystems, vol. 48, no. 1-3, pp. 57–65, 1998.

[33] S. Thorpe, A. Delorme, and R. Van Rullen, “Spike-based strategies forrapid processing,” Neural networks, vol. 14, no. 6-7, pp. 715–725, 2001.

[34] S. J. Thorpe, “Spike arrival times: A highly efficient coding schemefor neural networks,” Parallel processing in neural systems, pp. 91–94,1990.

[35] C. Kayser, M. A. Montemurro, N. K. Logothetis, and S. Panzeri, “Spike-phase coding boosts and stabilizes information carried by spatial andtemporal spike patterns,” Neuron, vol. 61, no. 4, pp. 597–608, 2009.

[36] J. Kim, H. Kim, S. Huh, J. Lee, and K. Choi, “Deep neural networkswith weighted spikes,” Neurocomputing, vol. 311, pp. 373–386, 2018.

[37] M. Fiaz, A. Mahmood, S. Javed, and S. K. Jung, “Handcrafted and deeptrackers: A review of recent object tracking approaches,” ACM Surveys,2018.

[38] L. Bertinetto, J. Valmadre, S. Golodetz, O. Miksik, and P. H. Torr,“Staple: Complementary learners for real-time tracking,” in Proceedingsof the IEEE conference on computer vision and pattern recognition,2016, pp. 1401–1409.

[39] W. Ruan, J. Chen, Y. Wu, J. Wang, C. Liang, R. Hu, and J. Jiang, “Multi-correlation filters with triangle-structure constraints for object tracking,”IEEE Transactions on Multimedia, vol. 21, no. 5, pp. 1122–1134, 2018.

[40] D. Held, S. Thrun, and S. Savarese, “Learning to track at 100 fps withdeep regression networks,” in ECCV. Springer, 2016, pp. 749–765.

[41] Q. Wang, C. Yuan, J. Wang, and W. Zeng, “Learning attentionalrecurrent neural network for visual tracking,” IEEE Transactions onMultimedia, vol. 21, no. 4, pp. 930–942, 2018.

[42] H. Hu, B. Ma, J. Shen, H. Sun, L. Shao, and F. Porikli, “Robust objecttracking using manifold regularized convolutional neural networks,”IEEE Transactions on Multimedia, vol. 21, no. 2, pp. 510–521, 2018.

[43] B. Li, J. Yan, W. Wu, Z. Zhu, and X. Hu, “High performance visualtracking with siamese region proposal network,” in CVPR, 2018, pp.8971–8980.

[44] S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-timeobject detection with region proposal networks,” in Advances in neuralinformation processing systems, 2015, pp. 91–99.

[45] X. Li, C. Ma, B. Wu, Z. He, and M.-H. Yang, “Target-aware deeptracking,” in CVPR, 2019, pp. 1369–1378.

[46] M. Mulansky and T. Kreuz, “Pyspike-a python library for analyzingspike train synchrony,” SoftwareX, vol. 5, pp. 183–189, 2016.

[47] M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisser-man, “The pascal visual object classes (voc) challenge,” Internationaljournal of computer vision, vol. 88, no. 2, pp. 303–338, 2010.