arxiv:1612.00089v1 [cs.cv] 1 dec 2016

Beyond standard benchmarks: Parameterizing performance evaluationin visual object tracking

Luka Cehovin, Alan Lukezic, Ales Leonardis, Matej KristanUniversity of Ljubljana

Faculty of computer and information science{luka.cehovin,alan.lukezic, ales.leonardis, matej.kristan}@fri.uni-lj.si

Abstract

Benchmarks have played an important role in advanc-ing the field of visual object tracking. Due to weakly de-fined and sometimes biased attribute specification, exist-ing benchmarks do not allow fine-grained tracker analy-sis with respect to specific attributes. Apart from illumi-nation changes and occlusions, the tracking performanceis most strongly affected by the object motion. In this pa-per, we propose a novel approach for tracker evaluationwith respect to the motion-related attributes. Our approachutilizes 360 degree videos to generate realistic annotatedshort-term tracking scenarios with exact specification of theobject motion type and extent. A fully annotated dataset of360 degree videos was constructed and fine-grained perfor-mance of 17 state-of-the-art trackers is reported. The pro-posed approach offers unique tracking insights, is comple-mentary to existing benchmarks, and will be made publiclyavailable. The evaluation system was implemented withina state-of-the-art performance evaluation toolkit and sup-ports straight-forward extension with third-party trackers.

1. Introduction

Single-target visual object tracking has made significantprogress in the last decade. To a large extent this can beattributed to the adoption of benchmarks, through whichcommon evaluation protocols, datasets and baseline algo-rithms have been established. Starting with PETS initia-tive [28] in 2005, several benchmarks on general single-target short-term tracking have been developed since, mostnotably OTB50 [25], VOT2013 [15], ALOV300+ [21],VOT2014 [16, 14], OTB100 [26], TC128 [17] andVOT2016 [12].

A recent work on test-data validation in computer vi-sion [29] argued that dataset variation, systematic organi-zation and low redundancy are crucial for practical evalu-

ation. The existing benchmarks address this by increasingthe number of sequences [21], applying advanced datasetconstruction methodologies [12] and by annotating en-tire sequences or even individual frames with visual at-tributes [26, 14].

Figure 1: Generating view frames from a 360 degree sourcesequence. Manipulation of the camera parameters enablesaccurate parametrization of motion types in the generatedviewpoint sequences.

Annotation of frames (sometimes entire sequences) withmultiple attribute labels prohibits accurate attribute-wiseanalysis due to the attribute cross-talk. Moreover, the at-tribute labels are typically subject to human bias and aremerely binary annotations without accurate parametriza-tion. Such parametrization is possible through computergraphics generated sequences [9, 18], however the level ofrealism in object motion and appearance in such sequencesstill presents a limitation for performance evaluation of gen-eral tracking methods.

1

arX

iv:1

612.

0008

9v1

[cs

.CV

] 1

Dec

201

6

The recent benchmarks [12, 18] report that, apart fromthe obvious situations like full occlusions, the trackers’ per-formance is largely affected by the object motion with re-spect to the camera. In fact, in many applications, suchas drone tracking, performance varies mainly with respectto scale changes and fast motion [18]. Detailed analysesof these important motion attributes in existing benchmarksremain limited due to a limited size of datasets, lack of ac-curate attribute annotation, and lack of full camera controlduring sequence acquisition.

This paper addresses the aforementioned limitations ofthe existing benchmarks by proposing a framework forparametrization of motion related attributes. We consideromnidirectional 360 degree videos as prerecorded proxyrepresentations of 3D world that capture photo-realism ofstandard benchmarks. The framework offers a high de-gree of virtual camera control, which has so far been pos-sible only in computer graphics generated sequences. Ourapproach focuses on detailed analysis of motion attributesand, at this point, we do not evaluate performance underocclusions and illumination changes. In this respect our ap-proach complements the existing benchmarks in this impor-tant class of sequence attributes.

Contributions. The first contribution is a new motion-type performance evaluation paradigm and a new methodto generate realistic sequences with high degree of motionparametrization. We introduce a virtual camera model thatutilizes 360 degree videos to generate realistic annotatedshort-term tracking scenarios (Figure 1). The exact specifi-cation of parameterized motion types in the sequences guar-antees a clear causal relation between the generated motionsand the tracking performance change. This allows for afine-grained performance analysis that goes far beyond thecapabilities of existing benchmarks. As part of the contribu-tion, a parameterized sequence generator is developed usinga popular performance evaluation toolkit [12]. The gener-ator is highly extendable and compatible with the existingevaluation protocols.

The second major contribution is a new motion bench-mark for short-term single-target trackers, MoBe2016. Wehave constructed a dataset of annotated 360 degree videosadding up to 17537 frames, defined twelve motion types andevaluated 17 state-of-the-art trackers from recent bench-marks [26, 12]. The new benchmark, the results and thecorresponding software will be made publicly available andare expected to significantly affect future developments insingle-target tracking.

The paper is organized as follows, Section 2 contains areview of related work, Section 3 describes our sequenceparameterization approach, Section 4 presents the proposedbenchmark, Section 5 contains presentation of results, andSection 6 contains discussion and concluding remarks.

2. Related workModern short-term tracking benchmarks [26, 15, 14,

13, 12, 21, 18] acknowledge the importance of motion re-lated attributes and support evaluation with respect to these.However, the evaluation capability significantly depends onthe attribute presence and distribution, which is often re-lated to the sequence acquisition. In application-orientedbenchmarks like [18] the attribute distribution is necessar-ily skewed by the application domain. Some general bench-marks [26, 21] thus include a large number of sequencesfrom various domains. But since the sequences are post-hocannotated, the dataset diversity is not guaranteed. A recentbenchmark [12] addressed this by considering the attributesalready at the sequence collection stage and applied an elab-orate methodology for automatic dataset construction.

The strength of per-attribute evaluation depends on theannotation approach. In most benchmarks [26, 21, 18] allframes are annotated by an attribute even if it occupiesonly a part of the sequence. Kristan et al. [14] argued thatthis biases per-attribute performance towards average per-formance and proposed a per-frame annotation to reducethe bias. However, a single frame might still contain severalattributes, resulting in the attribute cross-talk bias.

The use of computer graphics in training and evaluationhas recently been popularized in computer vision. Muelleret al. [18] propose online virtual worlds creation for dronetracking evaluation, but using only a single type of the ob-ject, without motion parametrization and low level of real-ism. Vig et al. [9] address the virtual worlds realism lev-els and ambient parametrization learning and performanceevaluation, however, just for vehicle detection.

2.1. Comparison with our new motion benchmark

A comparison of our proposed MoBe2016 with relatedbenchmarks is summarized in Table 1. The values un-der MAC indicate the percentage of frames in the datasetwith at least a single motion attribute. The motion at-tributes are most frequent in the MoBe (100% coverage) andUAV123 [18] (96% coverage). To reflect the dataset sizein motion evaluation, we compute the number of effectiveframes per attribute (FPA). This measure counts the numberof frames that contain a particular motion attribute, whereeach frame contributes with weight inversely proportionalto the number of motion attributes it contains. The FPA ishighest for an application-specific UAV123 [18] (27107).Among the general tracking benchmarks, this value is high-est for the proposed MoBe2016 (17537), which exceeds thesecond largest (OTB100 [26]) by over 30%.

The FPA alone does not fully reflect the evaluationstrength since it does not account for the attribute cross-talk.A lower bound on the cross-talk is reflected by the INTERmeasure that shows a percentage of motion-annotatedframes with at least two motion attributes. The measure

shows that well over half of the frames in UAV123 [18](78%) and OTB100 [26] (63%) suffer from the attributecross-talk. The cross-talk is lowest for the proposedMoBe2016 (0%), ALOV [21] (0%) and VOT2016 [12](32%).

On average, most existing benchmarks are annotated byfour motion types. The proposed MoBe2016 contains ap-proximately three times more motion types than existingbenchmarks. The existing benchmarks lack motion typequantification (e.g., the extent of speed in attribute fast mo-tion), which results in inconsistent definitions across bench-marks. In contrast, the motion types are objectively definedthrough their parametrization in the proposed MoBe2016.

Table 1: Comparison of MoBe2016 with popular recenttracking benchmarks: ALOV300+ [21], OTB100 [26],UAV123 [18] and VOT2016 [12]. Best, second best andthird best values are shown in red, blue and green, respec-tively.

Dataset [21] [26] [18] [12] MoBeMAC (%) 19 88 96 61 100FPA 4275 12929 27107 4366 17537INTER (%) 0 63 78 32 0Motion classes 3 3 3 3 6Motion types 4 4 4 3 12Parameterized no no no no yesPer-frame no no no yes yes

3. Sequence parametrizationTwo key concepts are introduced by our motion evalu-

ation methodology: a source sequence and a viewpoint se-quence. A source sequence is an omnidirectional video thatsimultaneously captures 360 degree field of view. The videois stored as a projection onto a spectator-centered sphere,i.e., S = {St}t=1∶N , where St is a projection at frame t.Such representation allows to generate arbitrary views of a3D scene from the point of observation.

A viewpoint sequence is a sequence of images obtainedfrom a spherical representation by projection into a pinholecamera, i.e., I = {It}t=1∶N . The camera model has ad-justable rotation and focal length parameters, thereby defin-ing the state of the camera at time t as

Ct = [αt, βt, γt, ft], (1)

where the first three parameters are the Euler angles andft denotes the focal length. Each frame in a viewpointsequence is therefore the result of the corresponding im-age in the source sequence and the camera parameters, i.e.It = pcam(St;Ct).

The ground truth object state in each frame is speci-fied in a viewpoint-agnostic spherical coordinate system,

i.e., A = {At}t=1∶N . Following the VOT Challenge proto-col [14] the state is defined as a rectangle using four-pointsAt = {θ

it, ρ

it}i=1∶4. Given a pinhole camera viewpoint pa-

rameters Ct, the ground truth At is projected into the imageplane by projective geometry, i.e., Gt = pgt(At;Ct).

The camera parameters Ct are continually adjusted dur-ing the creation of the viewpoint sequence to keep the pro-jected object within the field of view, thus satisfying theshort-term tracking constraint. The camera viewpoint is ad-justed via a camera controller pcon(⋅, ⋅) that applies a pre-scribed motion type E and maps the object ground truthstate into camera parameters while satisfying the short-termtracking constraint, i.e.,

pcon(At, E, t) ↦ Ct, (2)

Depending on the motion type specification, the controllergenerates various apparent object motions.

3.1. Motion type evaluation framework

The evaluation framework implements the VOT super-vised evaluation mode [6] and the VOT [14, 12] perfor-mance evaluation protocol, which allows full use of longsequences. In this evaluation mode, a tracker is initializedand re-set upon drifting off the target. Stochastic trackersare run multiple times and the results are averaged.

Figure 2: Overview of the motion type evaluation frame-work. The tracker is initialized with an image from theviewpoint sequence and initialization region and is then se-quentially exchanging new images for predicted region ofthe object which is inspected to see if the tracker has failed.

The following functionality is required by the super-vised experiment mode: (1) reproducible sequence gener-ation and (2) bi-directional tracker-evaluator communica-tion. The viewpoint sequence generator is applied to gen-

erate one frame at a time from the 360 degree source se-quence, as well as the ground truth annotations and the cam-era controller motion type parameters. The sequence andthe 2D ground truth are therefore generated on the fly duringthe valuation and are reproducible for each time-step. Thecommunication between the the evaluator and the tracker isimplemented through the state-of-the-art TraX [22] commu-nication protocol. Our motion type evaluation framework issummarized in Figure 2.

4. Motion benchmark—MoBe2016

Our motion parametrization framework is demonstratedon a novel single-target visual object tracking motionbenchmark (MoBe2016). The benchmark contains a libraryof 17 state-of-the-art trackers, twelve motion types and adataset of 360 degree source sequences. The dataset, themotion types, the trackers tested and the performance mea-sures are explained next.

4.1. Dataset acquisition

The new dataset contains fifteen 360 degree videos withan average video length of 1169 frames, amounting to17537 frames. The videos were mostly selected from alarge collection of 360 degree videos available on YouTube.To maximize the target diversity, we recorded additional se-quences using Ricoh Theta 360 degree camera. Videos wereconverted to a cube-map projection and encoded with MP4H.264 codec. Each frame of the video was manually anno-tated by a rectangular region encoded in spherical coordi-nates using an annotation tool specifically designed for thisuse case. Some of the viewpoint frame examples of individ-ual source sequences are shown in Figure 3.

Figure 3: A preview of sequences in the dataset from theview that centers the target.

4.2. Motion types specification

We consider six motion classes that reflect typical dy-namic relations between an object (target) and a camera:Stabilized setup, denoted as Eb, keeps the object at imagecenter and adjusts the camera distance to keep the objectdiagonal constant at 70 pixels. A variant with a diagonalconstant at 35 pixels is considered as well to test trackingobjects from far away, Es

b.Centered rotation setup, denoted as Er, fixes the objectcenter and the scale as Eb and then rotates the cameraaround the optical axis. Two variants, with low and highrotation speeds, Es

r and Efr, respectively, are considered.

Displaced rotation setup, denoted as Ed, displaces the ob-ject center and then rotates the camera around its opticalaxis.Scale change setup, denoted as Es, fixes the center and thenperiodically changes the scale by a cosine function with am-plitude oscillation around the nominal scale of Eb. Twovariants, i.e., with a low, Es

s, and a high frequency, Efs, but

equally moderate amplitude are considered. Another vari-ant with a moderate frequency but large amplitude, Ew

s , isconsidered as well.Planar motion setup, denoted as Em, displaces the cam-era from the object center and performs circular motion inimage plane. A variant with low – E

sm and high – E

fm fre-

quency are considered.Translation noise setup, denoted as En, fixes the centerand the scale as in Eb then randomly displaces the center bydrawing a displacement vector from a normal distribution.Two variants, one with small, Es

n, and one with large, Eln,

noise are considered.The variations of six motion classes result in 12 differ-

ent motion types, which are illustrated in Figure 4; for exactvalues of motion parameters please see Supplementary ma-terial. Note that each 360 degree video in our dataset cre-ates a sequence with specific motion parameters. Thus eachmotion type is evaluated on all frames with only a single(quantitatively defined) motion type per frame present.

4.3. Trackers tested

A set of 17 trackers was constructed by considering base-line and top-performing representatives on recent bench-marks [26, 12] from the following 6 broad classes of track-ers. (1) Baselines include standard discriminative and gen-erative trackers MIL [2], CT [31], IVT [20], and Frag-Trac [1], a state-of-the-art mean-shift tracker ASMS [24],and Struck SVM tracker [10]. (2) Correlation filters includethe standard KCF [11] and three top-performing correla-tion filters on VOT2016 [12] – DSST [7], Staple [4] andSRDCF [8]. (3) Sparse trackers include top-performingsparse trackers L1APG [3] and ASLA [27]. (4) Part-based trackers include the recent state-of-the-art CMT [19],

Figure 4: Illustrative representations of motion types inMoBe2016.

LGT [32], FoT [23]. In addition the set comprises a state-of-the-art (5) Hybrid tracker MEEM [30] and (6) ConvNettracker SiamCF [5].

4.4. Performance measures

The tracking performance is measured by the VOT [13]measures: tracker accuracy (A), robustness (R) as well asthe expected average overlap (EAO). The accuracy mea-sures the overlap between the output of the tracker andthe ground truth bounding box during periods of successfultracking, while the robustness measures the number of timesa tracker failed and required re-initialization [6]. The ex-pected average overlap score is an estimator of the averageoverlap on a typical short-term sequence a tracker wouldobtain without reset [13]. All scores are calculated on per-sequence basis and averaged with weights proportional tothe sequence length.

5. ResultsThe 17 trackers were evaluated on a total of 210444

frames containing various motion types, which makes thisthe largest fine-grained motion-related tracker evaluation todate. Results are summarized by A-R plots (Figure 5), gen-eral performance graphs (Figure 6, Figure 7) and in Ta-ble 21. In the following we discuss performance with re-spect to various motion classes and types.Scale adaptation: Slow scale changes (Es

s , Figure 6) areaddressed best by correlation filters that apply scale adapta-tion (i.e., KCF, DSST, Staple, SRDCF). Their performance

1Plots that contain more details are also available in higher resolutionin the supplementary material together with videos that depict trackingperformance on the dataset.

Figure 6: EAO values for all motion types over tested track-ers.

is not significantly affected as long as the change is grad-ual enough, even for large amplitudes (Ew

s ). However, fastchanges (Ef

s ) significantly reduce performance, implyingthat the number of scales explored should be increased inthese trackers. The ConvNet tracker SiamCN does not suf-fer from this discrepancy, which is likely due to a large setof scales it explores. The difference in performance dropfor fast (Ef

s ) and large (Ews ) scale change is low for scale-

adaptive mean shift ASMS and part-based trackers (i.e.,CMT, FoT and LGT). In contrast to correlation filters, thesetrackers do not greedily explore the scale space but applyblob size estimation (ASMS) or apply key-point-like match-ing approaches (CMT, FoT, LGT). Average performance atmoderate scale change is better for correlation filters thanpart-based trackers. Struck and MEEM are least affectedby scale change among the trackers that do not adapt theirscale. From the AR plots in Figure 5 it is apparent that theperformance drops are due to a drop in accuracy, but not infailures.Rotation: Rotation (Er) significantly affects performanceof all tracking classes. Figure 7 and the AR plots in Figure 5show that the drop comes from a reduced accuracy as wellas increased number of failures across most trackers. Thedrop is least apparent with ASMS, FoT and LGT which islikely due to their object visual models. The visual modelin ASMS is rotation invariant since it is based on colorhistograms, while FoT and LGT explicitly address rotationby geometric matching. Rotation most significantly affectsperformance of correlation filters and ConvNets (Figure 6).These trackers apply templates for tracking and since ro-tation results in significant discrepancies between the tem-plate and object, the trackers fail. In particular, from theAR plots in Figures 5 we see that slow rotation (Es

r ) onlyresults in decreased accuracy, but fast rotation (Ef

r ) re-sults in increased failures as well (e.g., SRDCF). On the

▽ ASLA× ASMS◇ CMT▷ CT◁ DSST+ FoT⭐ FragTrack× IVT○ KCF□ L1APG○ LGTN MEEM▽ MIL◇ SRDCF0 SiamFC△ StapleN Struck

Figure 5: A-R plots for each experiment in the benchmark. The vertical axis denotes accuracy and the horizontal axis denotesrobustness. The sensitivity visualization parameter was set to 100 frames in all plots.

other hand, the performance of correlation filters, ConvNettracker (SiamFC) and hybrid tracker (MEEM) surpasses thepart-based models when no rotation is observed (Eb in Fig-ures 5 and Figure 6).

Motion: From the AR plots in Figure 5 we see that slowplanar motion (Es

m) only slightly reduces performance ingeneral, but this reduction is significant for most trackersin case of fast motion (Ef

m). LGT is the only tracker re-silient to fast motion. A likely reason is the use of nearly-constant-velocity motion model in the LGT. However, theperformance significantly drops for this tracker when exten-sive random motion is observed (Ef

n in Figure 6). Trackerslike SiamFC and MEEM are least affected by all types offast motions. The reason is likely in their very large searchregion for target localization. The AR plots in Figure 5 in-dicate that SiamCF fails much more often at fast motions(Ef

m) than MEEM implying that MEEM is more robust atlocal search.

Object size: All trackers perform very well in the baselinesetup (Eb) in which the object is kept centered and of con-stant size (Figure 7 and Figure 6). In fact, top performanceis achieved by the correlation filter trackers. The reason isthat the visual model assumptions that these trackers makeexactly fit this scenario. When considering smaller ob-jects (Es

b ) the following trackers appear unaffected: ASMS,KCF, SRDCF, L1APG, CMT, FoT, LGT and MEEM. Thisimplies that the level of detail of target representation inthese trackers is unaffected by the reduced object size. Notethat these trackers come from different classes. The ARplots in Figure 5 show that performance drop in trackingsmall objects is most significant for baselines like CT, IVT,MIL and Fragtrac as well as a sparse tracker ASLA andthe Struck tracker. The performance drop comes from in-creased failures, which means that their representation isnot discriminative enough on this scale which leads to fre-quent drifts.

Figure 7: Motion types difficulty according to robustness,accuracy, and EAO over all evaluated trackers. Motiontypes are grouped by their corresponding motion classes:stabilized (Eb), centered rotation (Er), displaced rotation(Ed), scale change (Es), planar motion (Em) and noise(En).

General observations: All trackers exhibit a large perfor-mance variance across the motion types (Figure 6). Thevariance appears lowest over most motion types for thepart-based trackers, although their average performance ismoderate. Table 2 shows the average performance overthe motion types without the baseline motion type (Eb).Among the trackers whose average EAO is within 70% ofbest EAO are three out for four correlation filters, a hy-brid tracker (MEEM), a ConvNet tracker (SiamFC), two

out of three part-based trackers and two baselines (ASMSand Struck). The top three trackers in average peformanceare Staple (0.470 EOA), MEEM (0.468 EAO) and SiamCF(0.448 EAO). These trackers are also performing well onthe recent benchmarks, however, our analysis shows thatthe weak spot of these trackers are target rotations, as wellas fast movements and shaky videos.Motion class difficulty: Considering the average EAOin Figure 7, the most difficult classes are rotation (bothtypes—central and displaced) as well as planar motion andtranslation noise, but the distribution of difficulty within in-dividual classes as well as the degradation modes vary. TheAR plots in Figure 5 show that performance drops in rota-tion are due to inaccurate bounding box estimation, leadingto reduced accuracy but not to complete failure. This figurealso shows that trackers generally well address planar mo-tion, but tend to fail at fast nonlinear motions due to largeinter-frame displacement.

5.1. Relation to existing benchmarks

A relation of MoBe2016 to the existing tracking bench-marks was established by comparing the ranks of com-mon trackers on a well known OTB100 [26] and the recentUAV123 [18] benchmark.

Comparison with OTB100: The OTB100 contains rel-atively old trackers, therefore the intersection is in the fol-lowing six trackers: ASLA, CT, FoT, IVT, L1-APG, MIL,and Struck. Figure 8 shows the ranking differences betweenthese trackers for the different ranking modes. The rankingby average performance differs mainly L1-APG and FoTtrackers. The possible reasons for this are different im-plementations, algorithm parameters, as well as differentevaluation methodology2. Three motion types in OTB100are compatible with MoBe2016: scale variation, fast mo-tion and in-plane rotation. Performance over three scalechanging motion types on MoBe was averaged to obtaina scale change ranking. While the FoT achieves top per-formance on MoBe2016 it is positioned relatively low onOTB100, which is likey due to different implementations(ours is from the authors) and interaction of other attributeson OTB100. Both rankings place Struck at the top, ranks ofother trackers vary. The fast motion ranking on MoBe2016was obtained by averaging fast motions, i.e., Ef

m and Ewn .

Both benchmarks rank Struck as top performing and IVT asworst performing. The in-plane rotation attribute was com-pared with combined ranking of center and displaced rota-tion (Er and Ed). The situation is similar to scale change,where FoT, which explicitly addresses rotation is rankedmuch lower according to OTB100.

Comparison with UAV123: The MoBe2016 and

2The OTB100 methodology does not restart tracker on failure whichcan lead to large differences between trackers that support re-detection andthose that do not.

Figure 8: Tracker ranking comparison of MoBe2016 withOTB100 (left) and UAV123 (right). The trackers are sortedfrom left (best) to right. Attribute abbreviations: SV – ScaleVariation, FM – Fast Motion, IPT – In-Plane Rotation, CM– Camera Motion.

UAV123 intersect in the following seven trackers: ASLA,DSST, IVT, KCF, MEEM, SRDCF, and Struck. The com-parison of average performance as well as with respect tothree motion types scale variation, camera motion and fastmotion is shown in Figure 8. The ranks are mostly consis-tent, the best two trackers are mostly SRDCF and MEEM.A discrepancy is observed for the MEEM tracker at scalechange attribute. MEEM does not adapt the scale, whichresults in low rank at MoBe2016. But is ranked high onUAV123, which is likely due to attribute cross-talk. Thediscrepancy in KCF is due to implementation – our KCFadapts scale. Notice that the UAV123 ranks on camera mo-tion are equal to scale variation ranks. We therefore com-pare both with ranks obtained by averaging fast in-planemotion (Ef

m) and large translation noise (Ewn ) performance

on MoBe2016. The ranks match very well, which meansthat MoBe2016 offers a significant level of granularity inanalysis.

6. Discussion and Conclusions

We have proposed a novel approach for single-targettracker evaluation on parameterized motion-related at-tributes. At the core of our approach is the use of 360 degreevideos to generate annotated realistic-looking tracking sce-narios. We have presented a novel benchmark MoBe2016,composed of annotated dataset of fifteen such videos andthe results of 17 state-of-the-art trackers. The sequence gen-erator, the 360 degree video annotation tools, the evaluationsystem and the benchmark will be made publicly available.The results of our experiment provide a detailed overview ofstrengths and limitations of modern short-term visual track-ers.

As an example, the scale change appears to be well ad-dressed by many tracking approaches. Even trackers that donot adapt scale do not fail often. Nevertheless, in practice

Eb Esb E

sr E

fr Ed E

ss E

fs E

ws E

sm E

fm E

sn E

wn Average

△ Staple0.633 0.718

0.0850.367-0.266

0.329-0.304

0.356-0.277

0.627-0.006

0.479-0.154

0.563-0.070

0.558-0.075

0.055-0.578

0.6460.013

0.315-0.318

0.456-0.177

N MEEM0.610 0.609

-0.0010.375-0.235

0.310-0.300

0.367-0.243

0.434-0.176

0.380-0.230

0.408-0.202

0.535-0.075

0.358-0.252

0.6640.054

0.562-0.048

0.455-0.155

0 SiamFC0.603 0.525

-0.0780.346-0.257

0.305-0.298

0.334-0.269

0.546-0.057

0.433-0.170

0.482-0.120

0.487-0.116

0.296-0.307

0.538-0.065

0.477-0.126

0.433-0.169

○ KCF0.644 0.640

-0.0040.339-0.305

0.257-0.387

0.306-0.338

0.618-0.026

0.524-0.120

0.603-0.041

0.459-0.185

0.069-0.575

0.640-0.005

0.191-0.453

0.422-0.222

◇ SRDCF0.539 0.542

0.0020.320-0.219

0.141-0.398

0.214-0.326

0.5620.022

0.420-0.119

0.5480.009

0.467-0.073

0.267-0.272

0.6100.071

0.440-0.099

0.412-0.128

○ LGT0.540 0.550

0.0100.467-0.074

0.429-0.111

0.432-0.109

0.486-0.055

0.455-0.085

0.435-0.105

0.448-0.092

0.271-0.270

0.456-0.085

0.030-0.510

0.405-0.135

× ASMS0.498 0.491

-0.0070.375-0.123

0.390-0.107

0.415-0.082

0.458-0.040

0.409-0.089

0.410-0.088

0.490-0.008

0.206-0.292

0.5140.016

0.178-0.320

0.394-0.104

+ FoT0.442 0.430

-0.0110.422-0.020

0.297-0.145

0.417-0.024

0.437-0.005

0.4620.021

0.437-0.004

0.381-0.060

0.228-0.213

0.395-0.046

0.342-0.100

0.386-0.055

N Struck0.562 0.390

-0.1720.308-0.254

0.160-0.403

0.311-0.251

0.433-0.129

0.418-0.144

0.414-0.148

0.425-0.137

0.167-0.395

0.524-0.038

0.456-0.106

0.364-0.198

◁ DSST0.662 0.595

-0.0660.361-0.301

0.278-0.384

0.170-0.492

0.465-0.197

0.423-0.238

0.468-0.193

0.303-0.359

0.038-0.624

0.516-0.146

0.093-0.568

0.337-0.324

▽ MIL0.513 0.402

-0.1100.362-0.151

0.249-0.264

0.341-0.172

0.452-0.060

0.366-0.147

0.378-0.135

0.423-0.089

0.006-0.506

0.5220.009

0.061-0.452

0.324-0.189

▽ ASLA0.533 0.461

-0.0720.324-0.209

0.237-0.296

0.271-0.262

0.433-0.100

0.385-0.148

0.416-0.117

0.336-0.196

0.009-0.524

0.514-0.019

0.069-0.464

0.314-0.219

⭐ FragTrack0.590 0.514

-0.0760.274-0.316

0.259-0.331

0.052-0.538

0.426-0.164

0.491-0.099

0.434-0.157

0.123-0.467

0.003-0.587

0.539-0.052

0.136-0.454

0.296-0.295

▷ CT0.422 0.245

-0.1770.291-0.131

0.166-0.256

0.269-0.153

0.331-0.091

0.308-0.114

0.326-0.096

0.366-0.056

0.008-0.413

0.4370.015

0.086-0.335

0.258-0.164

× IVT0.458 0.427

-0.0320.269-0.190

0.083-0.376

0.151-0.308

0.442-0.016

0.327-0.131

0.419-0.039

0.273-0.186

0.004-0.455

0.353-0.106

0.030-0.428

0.252-0.206

□ L1APG0.495 0.491

-0.0040.230-0.265

0.192-0.303

0.135-0.360

0.332-0.163

0.264-0.231

0.272-0.223

0.289-0.206

0.005-0.490

0.421-0.074

0.058-0.437

0.244-0.251

◇ CMT0.316 0.297

-0.0190.178-0.139

0.125-0.191

0.148-0.168

0.304-0.012

0.280-0.036

0.277-0.039

0.194-0.122

0.053-0.264

0.3270.011

0.192-0.124

0.216-0.100

Average0.533 0.490

-0.0430.330-0.203

0.247-0.286

0.276-0.257

0.458-0.075

0.401-0.131

0.429-0.104

0.386-0.147

0.120-0.413

0.507-0.026

0.219-0.314

Table 2: Overview of the EAO scores and their relative differences according to the baseline EAO score of a tracker. Thetop value in each cell represents the absolute EAO score while the bottom one represents the EAO difference in relation tothe baseline experiment. Green color denotes relative increase, orange color relative decrease, and red and bold red colorsdecrease greater than 25% and 50% of the baseline score. The baseline experiment is also not taken into account whencomputing average tracker score.

scale change is often accompanied by appearance change orfast motion, which increase chances of failures. We believethat this is the reason why scale change is perceived as achallenging attribute in related benchmarks. The state-of-the-art trackers perform reasonably well in tracking smalltargets. Rotation and abrupt motion are two of the mostchallenging motion classes. Due to their scarcity on ex-isting benchmarks they remain poorly addressed by mostmodern trackers. Our results have shown that non-randommotions are well addressed by motion models, which havealso become rare in modern trackers. We believe that fu-ture research in tracker development should focus on thesetopics to make further improvements.

We have demonstrated the usefulness of the proposed ap-proach for evaluating trackers in a controlled, yet realisticenvironment. The approach is complementary to existing

benchmarks allowing better insights into tracking behavioron various motion types. Moreover, capturing 360 degreevideos is nowadays possible with commodity equipment.Therefore our dataset adaptation to a specific tracking sce-nario may in fact be easier than in traditional approachessince it does not require careful planning before the acqui-sition to cover all possible motion types.

Our future work will focus on generalizing our frame-work to complex motion types and their effects on track-ing performance. We also plan to explore adaptation of ourevaluation methodology to active tracking.

References[1] A. Adam, E. Rivlin, and I. Shimshoni. Robust Fragments-

based Tracking using the Integral Histogram. In IEEE Com-

puter Society Conference on Computer Vision and PatternRecognition, volume 1, pages 798–805. IEEE Computer So-ciety, jun 2006.

[2] B. Babenko, M.-H. Yang, and S. Belongie. Robust objecttracking with online multiple instance learning. IEEE Trans.Pattern Anal. Mach. Intell., 33(8):1619–1632, Aug. 2011.

[3] C. Bao, Y. Wu, H. Ling, and H. Ji. Real time robust l1 trackerusing accelerated proximal gradient approach. In Comp. Vis.Patt. Recognition, pages 1830–1837, June 2012.

[4] L. Bertinetto, J. Valmadre, S. Golodetz, O. Miksik, andP. H. S. Torr. Staple: Complementary learners for real-timetracking. In Comp. Vis. Patt. Recognition, pages 1401–1409,June 2016.

[5] L. Bertinetto, J. Valmadre, J. F. Henriques, A. Vedaldi, andP. H. Torr. Fully-convolutional siamese networks for objecttracking. arXiv preprint arXiv:1606.09549, 2016.

[6] L. Cehovin, A. Leonardis, and M. Kristan. Visual objecttracking performance measures revisited. IEEE Trans. ImageProc., 25(3):1261–1274, 2016.

[7] M. Danelljan, G. Hager, F. S. Khan, and M. Felsberg. Ac-curate scale estimation for robust visual tracking. In Proc.British Machine Vision Conference, pages 1–11, 2014.

[8] M. Danelljan, G. Hager, F. S. Khan, and M. Felsberg. Learn-ing spatially regularized correlation filters for visual track-ing. In 2015 IEEE International Conference on ComputerVision, ICCV 2015, Santiago, Chile, December 7-13, 2015,pages 4310–4318, 2015.

[9] A. Gaidon, Q. Wang, Y. Cabon, and E. Vig. Virtual worldsas proxy for multi-object tracking analysis. In The IEEEConference on Computer Vision and Pattern Recognition(CVPR), pages 4340–4349, June 2016.

[10] S. Hare, A. Saffari, and P. H. S. Torr. Struck: Structuredoutput tracking with kernels. In Int. Conf. Computer Vision,pages 263–270, Washington, DC, USA, 2011. IEEE Com-puter Society.

[11] J. F. Henriques, R. Caseiro, P. Martins, and J. Batista. High-speed tracking with kernelized correlation filters. IEEETrans. Pattern Anal. Mach. Intell., 37(3):583–596, 2014.

[12] M. Kristan, A. Leonardis, J. Matas, M. Felsberg,R. Pflugfelder, L. Cehovin, T. Vojir, G. Hager, A. Lukezic,and G. F. et. al. The visual object tracking vot2016 challengeresults. In Proc. European Conf. Computer Vision, 2016.

[13] M. Kristan, J. Matas, A. Leonardis, M. Felsberg, L. Cehovin,G. Fernandez, T. Vojir, G. Hager, G. Nebehay, and R. P. et.al. The visual object tracking vot2015 challenge results. InInt. Conf. Computer Vision, 2015.

[14] M. Kristan, J. Matas, A. Leonardis, T. Vojir, R. Pflugfelder,G. Fernandez, G. Nebehay, F. Porikli, and L. Cehovin. Anovel performance evaluation methodology for single-targettrackers. IEEE Trans. Pattern Anal. Mach. Intell., 2016.

[15] M. Kristan, R. Pflugfelder, A. Leonardis, J. Matas, F. Porikli,L. Cehovin, G. Nebehay, G. Fernandez, and T. Vojir et al.The visual object tracking vot2013 challenge results. InVis. Obj. Track. Challenge VOT2013, In conjunction withICCV2013, pages 98–111, Dec 2013.

[16] M. Kristan, R. Pflugfelder, A. Leonardis, J. Matas,L. Cehovin, G. Nebehay, T. Vojir, and G. F. et al. The visual

object tracking vot2014 challenge results. In Proc. EuropeanConf. Computer Vision, pages 191–217, 2014.

[17] P. Liang, E. Blasch, and H. Ling. Encoding color informa-tion for visual tracking: Algorithms and benchmark. IEEETransactions on Image Processing, 24(12):5630–5644, Dec2015.

[18] M. Mueller, N. Smith, and B. Ghanem. A benchmark andsimulator for uav tracking. In Proc. European Conf. Com-puter Vision, 2016.

[19] G. Nebehay and R. Pflugfelder. Clustering of static-adaptivecorrespondences for deformable object tracking. In Comp.Vis. Patt. Recognition, pages 2784–2791, June 2015.

[20] D. A. Ross, J. Lim, R.-S. Lin, and M.-H. Yang. Incrementallearning for robust visual tracking. Int. J. Comput. Vision,77(1-3):125–141, May 2008.

[21] A. Smeulders, D. Chu, R. Cucchiara, S. Calderara, A. De-hghan, and M. Shah. Visual tracking: An experimental sur-vey. IEEE Trans. Pattern Anal. Mach. Intell., 36(7):1442–1468, July 2014.

[22] L. Cehovin. Trax: Visual tracking exchange protocol. Tech-nical report, Apr 2014.

[23] T. Vojir and J. Matas. The enhanced flock of trackers. InRegistration and Recognition in Images and Videos, volume532 of Studies in Computational Intelligence. January 2014.

[24] T. Vojir, J. Noskova, and J. Matas. Robust scale-adaptive mean-shift for tracking. In J.-K. Kamarainen andM. Koskela, editors, Image Analysis: 18th ScandinavianConference, SCIA 2013, Espoo, Finland, June 17-20, 2013.Proceedings, pages 652–663, 2013.

[25] Y. Wu, J. Lim, and M.-H. Yang. Online object tracking: Abenchmark. In Comp. Vis. Patt. Recognition, pages 2411–2418, 2013.

[26] Y. Wu, J. Lim, and M. H. Yang. Object tracking benchmark.IEEE Trans. Pattern Anal. Mach. Intell., 37(9):1834–1848,Sept 2015.

[27] M.-H. Y. Xu Jia, Huchuan Lu. Visual tracking via adaptivestructural local sparse appearance model. In Comp. Vis. Patt.Recognition, pages 1822–1829, 2012.

[28] D. P. Young and J. M. Ferryman. Pets metrics: On-line per-formance evaluation service. In ICCCN ’05 Proceedings ofthe 14th International Conference on Computer Communi-cations and Networks, pages 317–324, 2005.

[29] O. Zendel, M. Murschitz, M. Humenberger, and W. Herzner.Cv-hazop: Introducing test data validation for computer vi-sion. In Int. Conf. Computer Vision, December 2015.

[30] J. Zhang, S. Ma, and S. Sclaroff. MEEM: robust trackingvia multiple experts using entropy minimization. In Proc.European Conf. Computer Vision, pages 188–203, 2014.

[31] K. Zhang, L. Zhang, and M.-H. Yang. Real-time compres-sive tracking. In Proc. European Conf. Computer Vision,pages 864–877, 2012.

[32] L. Cehovin, M. Kristan, and A. Leonardis. Robust vi-sual tracking using an adaptive coupled-layer visual model.IEEE Trans. Pattern Anal. Mach. Intell., 35(4):941–953,Apr. 2013.

arxiv:1612.00089v1 [cs.cv] 1 dec 2016

Documents