multi-cue visual tracking

16
5826 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 24, NO. 12, DECEMBER 2015 Joint Sparse Representation and Robust Feature-Level Fusion for Multi-Cue Visual Tracking Xiangyuan Lan, Student Member, IEEE, Andy J. Ma, Pong C. Yuen, Senior Member, IEEE, and Rama Chellappa, Fellow, IEEE Abstract—Visual tracking using multiple features has been proved as a robust approach because features could comple- ment each other. Since different types of variations such as illumination, occlusion, and pose may occur in a video sequence, especially long sequence videos, how to properly select and fuse appropriate features has become one of the key problems in this approach. To address this issue, this paper proposes a new joint sparse representation model for robust feature-level fusion. The proposed method dynamically removes unreliable features to be fused for tracking by using the advantages of sparse representation. In order to capture the non-linear similarity of features, we extend the proposed method into a general kernelized framework, which is able to perform feature fusion on various kernel spaces. As a result, robust tracking performance is obtained. Both the qualitative and quantitative experimental results on publicly available videos show that the proposed method outperforms both sparse representation-based and fusion based-trackers. Index Terms—Visual tracking, feature fusion, joint sparse representation. I. I NTRODUCTION V ISUAL tracking is an important and active research topic in computer vision community because of its wide range of applications, e.g., intelligent video surveillance, human computer interaction and robotics. Although it has been extensively studied in the last two decades, it still remains to be a challenging problem due to many appearance variations caused by occlusion, pose, illumination and so on. Therefore, effective modeling of the object’s appearance is one of the key issues for the success of a visual tracker [1], and many visual features have been proposed to account for the different variations [2]–[5]. However, since the appearance of a target and the environment changes dynamically, especially in long term videos, a single feature is insufficient to deal with all Manuscript received March 2, 2015; revised July 26, 2015; accepted August 27, 2015. Date of publication September 23, 2015; date of cur- rent version October 28, 2015. This work was supported by the General Research Fund through the Research Grants Council, Hong Kong, under Grant HKBU 212313. The associate editor coordinating the review of this manuscript and approving it for publication was Prof. Ling Shao. X. Lan and P. C. Yuen are with the Department of Computer Science, Hong Kong Baptist University, Hong Kong (e-mail: [email protected]; [email protected]). A. J. Ma was with the Department of Computer Science, Hong Kong Baptist University, Hong Kong. He is now with the Department of Computer Science, Johns Hopkins University, Baltimore, MD 21218 USA (e-mail: [email protected]). R. Chellappa is with the Center for Automation Research, Department of Electrical and Computer Engineering, University of Maryland Institute for Advanced Computer Studies, University of Maryland, College Park, MD 20742 USA (e-mail: [email protected]). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TIP.2015.2481325 such variations. As such, the use of multiple cues/features to model the object appearance has been proved to be a robust approach for improved performance [6]–[11]. Along this line, many algorithms have been proposed for visual tracking in the past years. Generally speaking, existing multi-cue track- ing algorithms can be roughly divided into two categories: score-level and feature-level. Score-level approaches combine classification score corresponding to different visual features to perform foreground and background classification. Methods such as online boosting [7], [12], multiple kernel boosting [6] and online multiple instance boosting [8] belong to this category. However, the Data Processing Inequality (DPI) [13] indicates that the feature-level contains more information than the classifier level. Therefore, feature-level fusion should be performed to take advantage of more informative cues for tracking. A typical feature-level fusion method is to concate- nate different feature vectors to form a single vector [9]. But such a method may result in a high-dimensional feature vector which may degrade tracking efficiency. In addition, ignoring the incompatibility of heterogeneous features may cause per- formance degradation. Another feature-level fusion method is multiple kernel learning (MKL) [14], which aims to learn a weighted combination of different feature kernels according to their discriminative power. Since some features extracted from a target may be corrupted due to some unexpected variations, performance of the pre-learned classifier based on MKL may be affected by such features. Moreover, because not all cues/features are reliable, combining all the features may not improve the tracking performance. As such, the dynamic selection/combination of visual cues/features is required for multi-cue tracking. Recently, multi-task joint sparse representa- tion (MTJSR) [15] has been proposed for feature-level fusion for object classification and promising results have been reported. In MTJSR, the class-level joint sparsity patterns among multiple features are discovered by using a joint sparsity-inducing norm. Therefore, the relationship among different visual cues can be discovered by the joint sparsity constraint. Moreover, high-dimensional features are represented by low-dimensional reconstruction weights for efficient fusion. However, MTJSR was derived based on the assumption that all representation tasks are closely related and share the same sparsity pattern, which may not be valid in tracking applications, since unreliable features may exist due to appearance and other variations of a tracked target. For example, when a tracked object is partially occluded, its local features extracted from the occluded part(s) may 1057-7149 © 2015 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

Upload: vijay-anandh

Post on 14-Jul-2016

20 views

Category:

Documents


2 download

DESCRIPTION

VIDEO

TRANSCRIPT

Page 1: Multi-Cue Visual Tracking

5826 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 24, NO. 12, DECEMBER 2015

Joint Sparse Representation and RobustFeature-Level Fusion for Multi-Cue Visual Tracking

Xiangyuan Lan, Student Member, IEEE, Andy J. Ma, Pong C. Yuen, Senior Member, IEEE,and Rama Chellappa, Fellow, IEEE

Abstract— Visual tracking using multiple features has beenproved as a robust approach because features could comple-ment each other. Since different types of variations such asillumination, occlusion, and pose may occur in a video sequence,especially long sequence videos, how to properly select and fuseappropriate features has become one of the key problems inthis approach. To address this issue, this paper proposes a newjoint sparse representation model for robust feature-level fusion.The proposed method dynamically removes unreliable featuresto be fused for tracking by using the advantages of sparserepresentation. In order to capture the non-linear similarityof features, we extend the proposed method into a generalkernelized framework, which is able to perform feature fusion onvarious kernel spaces. As a result, robust tracking performanceis obtained. Both the qualitative and quantitative experimentalresults on publicly available videos show that the proposedmethod outperforms both sparse representation-based and fusionbased-trackers.

Index Terms— Visual tracking, feature fusion, joint sparserepresentation.

I. INTRODUCTION

V ISUAL tracking is an important and active researchtopic in computer vision community because of its

wide range of applications, e.g., intelligent video surveillance,human computer interaction and robotics. Although it has beenextensively studied in the last two decades, it still remains tobe a challenging problem due to many appearance variationscaused by occlusion, pose, illumination and so on. Therefore,effective modeling of the object’s appearance is one of thekey issues for the success of a visual tracker [1], and manyvisual features have been proposed to account for the differentvariations [2]–[5]. However, since the appearance of a targetand the environment changes dynamically, especially in longterm videos, a single feature is insufficient to deal with all

Manuscript received March 2, 2015; revised July 26, 2015; acceptedAugust 27, 2015. Date of publication September 23, 2015; date of cur-rent version October 28, 2015. This work was supported by the GeneralResearch Fund through the Research Grants Council, Hong Kong, underGrant HKBU 212313. The associate editor coordinating the review of thismanuscript and approving it for publication was Prof. Ling Shao.

X. Lan and P. C. Yuen are with the Department of Computer Science, HongKong Baptist University, Hong Kong (e-mail: [email protected];[email protected]).

A. J. Ma was with the Department of Computer Science, Hong KongBaptist University, Hong Kong. He is now with the Department of ComputerScience, Johns Hopkins University, Baltimore, MD 21218 USA (e-mail:[email protected]).

R. Chellappa is with the Center for Automation Research, Departmentof Electrical and Computer Engineering, University of Maryland Institutefor Advanced Computer Studies, University of Maryland, College Park,MD 20742 USA (e-mail: [email protected]).

Color versions of one or more of the figures in this paper are availableonline at http://ieeexplore.ieee.org.

Digital Object Identifier 10.1109/TIP.2015.2481325

such variations. As such, the use of multiple cues/features tomodel the object appearance has been proved to be a robustapproach for improved performance [6]–[11]. Along this line,many algorithms have been proposed for visual tracking inthe past years. Generally speaking, existing multi-cue track-ing algorithms can be roughly divided into two categories:score-level and feature-level. Score-level approaches combineclassification score corresponding to different visual featuresto perform foreground and background classification. Methodssuch as online boosting [7], [12], multiple kernel boosting [6]and online multiple instance boosting [8] belong to thiscategory. However, the Data Processing Inequality (DPI) [13]indicates that the feature-level contains more information thanthe classifier level. Therefore, feature-level fusion should beperformed to take advantage of more informative cues fortracking. A typical feature-level fusion method is to concate-nate different feature vectors to form a single vector [9]. Butsuch a method may result in a high-dimensional feature vectorwhich may degrade tracking efficiency. In addition, ignoringthe incompatibility of heterogeneous features may cause per-formance degradation. Another feature-level fusion method ismultiple kernel learning (MKL) [14], which aims to learn aweighted combination of different feature kernels accordingto their discriminative power. Since some features extractedfrom a target may be corrupted due to some unexpectedvariations, performance of the pre-learned classifier based onMKL may be affected by such features. Moreover, because notall cues/features are reliable, combining all the features maynot improve the tracking performance. As such, the dynamicselection/combination of visual cues/features is required formulti-cue tracking.

Recently, multi-task joint sparse representa-tion (MTJSR) [15] has been proposed for feature-levelfusion for object classification and promising results havebeen reported. In MTJSR, the class-level joint sparsitypatterns among multiple features are discovered by usinga joint sparsity-inducing norm. Therefore, the relationshipamong different visual cues can be discovered by the jointsparsity constraint. Moreover, high-dimensional features arerepresented by low-dimensional reconstruction weights forefficient fusion. However, MTJSR was derived based on theassumption that all representation tasks are closely relatedand share the same sparsity pattern, which may not be validin tracking applications, since unreliable features may existdue to appearance and other variations of a tracked target.For example, when a tracked object is partially occluded,its local features extracted from the occluded part(s) may

1057-7149 © 2015 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted,but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

Page 2: Multi-Cue Visual Tracking

LAN et al.: JOINT SPARSE REPRESENTATION AND ROBUST FEATURE-LEVEL FUSION 5827

not be relevant to or well represented by the correspondingfeature dictionary. In such cases, reconstruction coefficientsof such features may not demonstrate the sparsity pattern,which is an intrinsic characteristic of sparse representationas indicated in [16] and [17]. In addition, since unreliablefeatures are not relevant to its corresponding featuredictionary, the reconstruction error may be large. Fusingsuch features with large reconstruction error may not beable to describe the object appearance well, which maydegrade the tracking performance. In [15], a weighted schemebased on LPBoost [18] is adopted to measure the confidenceof different features on a validation set for fusion. But invisual tracking, the variation of the tracked target cannot bepredicted, and it is not reasonable to measure the confidenceof different features based on off line validation set.

In order to address the above mentioned issues in feature-level fusion based trackers, we propose a new trackingalgorithm called Robust Joint Sparse Representation-basedFeature-level Fusion Tacker (RJSRFFT).1 We integrate featureselection, fusion and representation into a unified optimizationmodel. Within this model, unreliable features are detected andfeature-level fusion is simultaneously performed on selectedreliable features . Therefore, the negative effect from unreliablefeatures can be removed, and the relationship of reliablefeatures can be exploited to select representative templates fora sparse representation of the tracked target, which processesthe advantage of sparse trackers. Based on the proposedmodel, we also develop a general tracking algorithm calledRobust Joint Kernel Sparse Representation based Feature-level Fusion Tacker (K-RJSRFFT) to dynamically performfusion of different features from different kernel spaces, whichis able to utilize the non-linearity in features to enhance thetracking performance.

The contributions of this paper are as follows:

• We develop a new visual tracking algorithm based onfeature-level fusion using joint sparse representation. Theproposed method inherits all the advantages of jointsparse representation and is able to fuse multiple featuresfor object tracking.

• We present a method to detect the unreliable visualfeatures for feature-level fusion. By removing unreliablefeatures (outliers) which introduce negative effect infusion, the tracking performance can be improved.

• In order to utilize the non-linearity in data samples toenhance the performance of the proposed tracking algo-rithm, we extend the algorithm into a general frameworkwhich is able to dynamically combine multiple kernelswith multiple features.

The main differences of this journal version and our pre-liminary conference paper version [19] are as follows. First,the tracker reported in the conference version [19] is linearin data samples while this journal version considers non-linear extension and is able to dynamically perform feature-level fusion in multiple kernel spaces. Second, the optimiza-tion procedure used in this journal version (the proposedK-RJSRFFT method) is derived based on kernel functions

1Preliminary results of this work were presented in [19].

without knowing the exact mapping. Therefore, the proposedtracker K-RJSRFFT is able to operate in any implicit featurespaces once the kernel matrix is given. Third, to increasethe computational efficiency using the accelerated proximalgradient method, this journal version introduces a new methodto adaptively and tightly estimate the Lipschitz constant. Thisis different from the conference version [19] which used apredefined value, Finally, comprehensive analysis and exper-imental evaluation between the proposed trackers and state-of-the-art trackers are presented in this journal version, whichfurther demonstrates the effectiveness of the proposed robustfeature-level fusion scheme.

The rest of this paper is organized as follows. In Section II,we first review some related works on tracking from twoapproaches: generative and discriminative. We also give ananalysis of sparse representation-based trackers as well asreview joint sparse representation-based methods for computervision. In Section III, we will present the proposed feature-level fusion tracking algorithm. In Section IV, we discuss someimplementation details of the proposed algorithm. Experimentresults and conclusions are are presented in Sections V and VI,respectively.

II. RELATED WORK

In this section, we first review some nominal trackingalgorithms and discuss sparse representation-based trackers.We also review existing joint sparse representation methodsrelated to our proposed tracking algorithm. For a more com-prehensive study of tracking methods, interested readers canrefer to [1] and [20]–[24].

A. Generative Trackers

Generative trackers construct an appearance model todescribe the target and then search for an image regionthat best matches the object appearance model. In the lastdecade, much work has been done on object appearancemodeling, including mixture models [25], [26], kernel-basedmethods [27]–[29], subspace learning [3], [30]–[33] and linearrepresentations [34]–[40]. In [25], a WSL mixture model wasconstructed via the expectation maximization algorithm tohandle appearance variations and occlusion during tracking.In [27], a mean-shift procedure based on color histogramrepresentation was introduced to locate the object. Basedon the work in [27], some variants of kernel-based trackerswere proposed, e.g., [28], [29]. Since most kernel-basedtrackers do not have occlusion handling steps, fragment-basedtrackers [41] was proposed to handle occlusions using apatch-based appearance model. Based on the assumptionthat the object appearances under mild pose variations liein a low-dimensional subspace, Black and Jepson [30]employed subspace learning to construct an object appearancemodel off-line within the optical flow framework. Instead ofusing a pre-learned appearance model which may not wellaccount for large appearance changes, incremental subspacelearning methods, e.g., [3], [31], [32] attempted to learnadaptive, low-dimensional subspace for appearance modeling.Kwon and Lee [33] exploited sparse PCA to construct multipleobservation models to enhance the robustness to combinatorial

Page 3: Multi-Cue Visual Tracking

5828 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 24, NO. 12, DECEMBER 2015

appearance changes. Different from subspace learningapproaches, which require to learn subspace bases, sparselinear representation-based methods [34] directly use sparselinear combination of object templates plus trivial templatesto account for target appearance variations while modelingocclusion and image noise. Li et al. [35] proposed non-sparse metric-weighted least square regression for efficientappearance modeling, as such an adaptive metric can capturethe correlation of feature dimension, leading to the utilizationof more information for object appearance modeling.

B. Discriminative Trackers

Discriminative trackers cast tracking as aforeground/background classification problem, and locate theobject’s position by maximizing the interclass separation.Recent efforts on discriminative trackers include SVM-basedtracking [42], [43], online boosting [6], [7], [12], multipleinstance learning [8], [44], compressive tracking [4], andmetric learning trackers [45]–[47]. Avidan [42] employed anSVM based on optical flow for object tracking. However,since the proposed tracking algorithm learned the classifierwith off-line samples, it may not be able to handleunpredicted appearance variations. Different from the staticappearance model, an adaptive model can handle appearancevariation adaptively. In [7], an online boosting algorithmwas developed for feature selection for visual tracking.Since this method regards the current tracker position as apositive sample for model updating, the tracker may driftowing to updating the model with potential misalignedsamples. In [12], a semi-supervised learning algorithm wasproposed for tracking in which positive samples are collectedonly at the first frame to handle ambiguity of samplelabeling. Babenko et al. [8] developed an online trackingalgorithm within the multiple instance learning frameworkin which samples are collected in the view of bag. Withthe goal of enhancing resistance to background distracters,Jiang et al. [45] exploited neighborhood component analysismetric learning to learn adaptive metric for differentialtracking. Different from classical subspace learning-basedfeature extraction methods in visual tracking [31], [48]that are data-dependent and may be sensitive to misalignedsamples, Zhang et al. [4] exploited a data-independent sparserandom measurement matrix to extracted low-dimensionalfeatures for Naive Bayes classifier-based tracking. Recently,Hare et al. [49] proposed an online structured outputSVM-based tracker to incorporate structured constraints intoclassifier training stage. Henriques et al. [50] developed afast Fourier transformation-based tracker which exploits thecirculant structure of kernel in an SVM. In [51], a superpixel-based discriminative appearance model was introduced tohandle occlusion and deformation.

C. Sparse Representation Based Trackers

Based on the intuition that appearances of a tracked objectcan be sparsely represented by its appearance in previousframes, a sparse representation-based tracker was introducedin [17], which is robust to occlusion and noise. Subsequently,many algorithms have been proposed to improve tracking

accuracy with reduced computational complexity [22].Li et al. [52] exploited compressive sensing theory toreduce the template dimension to improve the computationalefficiency. Liu et al. [53] developed a two-stage sparseoptimization-based tracker in which sparse discriminativefeatures are selected and temporal information is exploited fortarget representation via dynamic group sparsity algorithm.Liu et al. [54] proposed a mean-shift tracker based onhistogram of local sparse code. Mei et al. [34] adopted�2 minimization to bound the �1 error in order to speed upparticle resampling, and Bao et al. [55] utilized the acceleratedproximal gradient method and introduced an �2 norm regu-larization step on trivial template coefficients to speed up andimprove the performance of the �1 tracker. Zhong et al. [56]combined a holistic sparsity-based discrminative classifierand a local sparse generative model to handle occlusion,cluttered background and model drift. Zhang et al. [57]proposed a multi-task joint sparse learning method to exploitthe relationship between particles such that the accuracy of�1 tracker can be improved. Jia et al. [58] developed a localsparse appearance model to enhance robustness to occlusion.Wang et al. [59] introduced sparsity regularization toincremental subspace learning to account for the noise causedby occlusion. Zhuang et al. [60] constructed a discriminativesimilarity map for reverse sparse-based tracking. In orderto exploit a visual prior for tracking, Wang et al. [61]constructed a codebook from SIFT descriptors learned froma general image data set for sparse coding, and a linearclassifier is trained on sparse code for foreground/backgroundclassification. The trackers mentioned above utilized asingle cue/feature for appearance modeling. To fuse multiplefeatures, Wu et al. [9] concatenated multiple features into ahigh-dimensional feature vector to construct a template setfor sparse representation. However, the high dimensionalityof the combined feature vector increases the computationalcomplexity of this method. And, fusion via concatenationmay not be able to improve performance when some sourcedata are corrupted. Wang et al. [62] introduced a kernel sparserepresentation-based tracking algorithm to fuse multiple fea-tures in kernel space for tracking, but this method assumes allfeatures are reliable and it could not adaptively select reliablefeatures for tracking. Recently, Wu et al. [63] proposed metriclearning-based structural appearance model in which weightedlocal sparse coding are fused for appearance modeling byconcatenation.

D. Multi-Task Joint Sparse Representation

Multi-task learning aims to improve the overall perfor-mance of related tasks by exploiting the cross-task rela-tionships. Yuan and Yan [15] formulated linear representa-tion models from multiple visual features as a multi-taskjoint sparse representation problem in which multiple fea-tures are fused via class-level joint sparsity regularization.Zhang et al. [64] proposed a novel joint dynamic sparsityprior for multi-observation visual recognition. Shekhar etal. [65] proposed a novel multimodal multivariate sparserepresentation method for multimodal biometrics recognition.Recently, Bahrampour et al. [66] incorporated a tree-structure

Page 4: Multi-Cue Visual Tracking

LAN et al.: JOINT SPARSE REPRESENTATION AND ROBUST FEATURE-LEVEL FUSION 5829

sparsity-inducing norm and a weighted scheme for multi-modality classification.

III. ROBUST FEATURE-LEVEL FUSION

FOR MULTI-CUE TRACKING

This section presents the details of the proposed feature-level fusion tracking algorithm. The proposed method isdeveloped based on the particle filter framework, and consistsof two major components: feature-level fusion based on jointsparse representation and detecting unreliable visual cues forrobust fusion.

A. Particle Filter

Our tracking algorithm is developed under the framework ofsequential Bayesian inference. Let lt and zt denote the latentstate and observation at time t , respectively. Given a set ofobservation Zt = {zt , t = 1, . . . , T } up to Frame T , the trueposterior P(lt |Zt) is approximated by a particle filter with aset of particles li

t , i = 1, . . . , n. The latent state variable lt ,which describes the motion state of target is estimated using:

l̃t = arg maxlit

P(lit |Zt) (1)

The tracking problem is thus formulated as recursively esti-mating of the posterior probability P(lt |Zt ),

P(lt |Zt) ∝ P(zt |lt )

∫P(lt |lt−1)P(lt−1|Zt−1)dlt−1 (2)

where P(lt |lt−1) denotes the motion model, and P(zt |lt ) theobservation model. We use the same motion model as in [5],and define the observation model using the proposed trackingalgorithm, which will be described in the following subsection.

B. Feature-Level Fusion Based on Joint SparseRepresentation

In the particle filter-based multi-cue tracking framework,we are given K types of visual cues, e.g. color, shape andtexture, to represent the tracking result in the current frame andtemplate images of the target object. Denote the k-th visualcues of the current tracking result and the n-th template imageas yk and xk

n , respectively. Inspired by the sparse trackingalgorithm [17], the tracking result in the current frame canbe sparsely represented by a linear combination of the targettemplates added by an error vector εk for each visual cue, i.e.

yk = Xkwk + εk, k = 1, · · · , K (3)

where wk is a weight vector with dimension N to reconstructthe current tracking result with visual cue yk based on thetemplate set Xk = [xk

1 , . . . , xkN ] and N is the number of

templates.In (3), the weight vectors w1, · · · , wK can be considered

as an underlying representation of the tracking result in thecurrent frame with visual cues y1, · · · , y K . In other words,the feature-level fusion is realized by discovering the relation-ship among the visual cues y1, · · · , y K to determine weightvectors w1, · · · , wK dynamically. To learn the optimal fusedrepresentation, we define the objective function by minimizing

the reconstruction error and a regularization term, i.e.

minW

1

2

K∑k=1

‖yk − Xkwk‖22 + λ�(W ) (4)

where ‖·‖2 represents �2 norm, λ is a non-negative parameter,W = (w1, . . . , wK ) ∈ R

N×K is the matrix of the weightvectors and � is the regularization function on W .

To derive the regularization function �, we assume that thecurrent tracking result can be sparsely represented by the sameset of chosen target templates with index n1, · · · , nc for eachvisual cue, i.e.

yk = wkn1

xkn1

+ · · · + wknc

xknc

+ εk, k = 1, · · · , K (5)

Under the joint sparsity assumption, the number of chosentarget templates c = ‖(‖w1‖2, · · · , ‖wN ‖2)‖0 is a smallnumber. Therefore, we can minimize the sparsity measurementas the regularization term in optimization problem (4). Sincethe �0 norm can be approximated by the �1 norm to make theoptimization problem tractable, we define � as the followingequation similar to that in [15] measuring the class-levelsparsity for classification applications,

�(W ) = ‖(‖w1‖2, · · · , ‖wN ‖2)‖1 =N∑

n=1

‖wn‖2 (6)

where wn denotes the n-th row in matrix W correspondingto the weights of visual cues for the n-th target template.With this formulation, the joint sparsity across different visualcues can be discovered, i.e. wn becomes zero for a largenumber of target templates when minimizing the optimizationproblem (4). This ensures that all the selected templates (withnon-zero weights) play more important roles in reconstructingthe current tracking result for all the visual cues.

C. Detecting Unreliable Visual Cues for Robust Fusion

Since some visual cues may be sensitive to some variations,the assumption about shared sparsity may not be valid fortracking. Such unreliable visual cues of the target cannot besparsely represented by the same set of the selected targettemplates. That means, for the unreliable visual cue yk′

, allthe target templates are likely to have non-zero weighting forsmall reconstruction error, i.e.

yk′ = wk′1 xk′

1 + · · · + wk′N xk′

N + εk′(7)

where wk′1 , . . . , wk′

N are non-zero weights. In this case, we can-not obtain a robust fusion result by minimizing optimizationproblem (4) with the regularization function (6).

Although unreliable features cannot satisfy (5), reliable fea-tures can still be sparsely represented by (5) and used to choosethe most informative target templates for reconstruction. Withthe selected templates of index n1, · · · , nc, we rewrite (7) asfollows,

yk′ −c∑

i=1

wk′ni

xk′ni

=N−c∑j=1

wk′m j

xk′m j

+ εk′(8)

where m j denotes the index for the template whichis not chosen to reconstruct the current tracking result.

Page 5: Multi-Cue Visual Tracking

5830 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 24, NO. 12, DECEMBER 2015

Suppose we have K ′ unreliable visual cues. Without lossof generality, let visual cues 1, · · · , K − K ′ be reliable,while K − K ′ + 1, · · · , K be unreliable. To detect the K ′unreliable visual cues, we employ the sparsity assumption forthe unreliable features, i.e. the number of unreliable visualcues K ′ = ‖(∑N−c

j=1 |w1m j

|2, · · · ,∑N−c

j=1 |wKm j

|2)‖0 is a smallnumber, which can be used to define the regularization func-tion. Similar to (6), the �1 norm is used instead of the �0 norm.Combining with the regularization function for discoveringthe joint sparsity among reliable features, the regularizationfunction � in (4) becomes

�(W ) = θ1

N∑n=1

K−K ′∑k=1

|wkn|2 + θ2

K∑k=1

N−c∑j=1

|wkm j

|2 (9)

where θ1 and θ2 are non-negative parameters to balancethe joint sparsity across the selected target templates andunreliable visual cues.

However, we have no information about the selected tem-plates and unreliable features before learning, so we cannot apriori define the regularization function like (9). Inspired byrobust multi-task feature learning [67], the weight matrix Wcan be decomposed into two terms R and S with W = R + S.Suppose the non-zero weights of the reliable features beencoded in R, while the non-zero weights of the unreliablefeatures encoded in S. The current tracking result of thereliable visual cue k can be reconstructed by the informationin R only, i.e.(5) is revised as

yk = rkn1

xkn1

+ · · · + rknc

xknc

+ εk, k = 1, · · · , K − K ′ (10)

On the other hand, (8) for the unreliable feature k ′ ischanged to

yk′ −c∑

i=1

sk′ni

xk′ni

=N−c∑j=1

sk′m j

xk′m j

+ εk′,

k ′ = K − K ′ + 1, · · · , K (11)

Based on the above analysis, the final regularization functioncan be defined analogous to (9), i.e.

�(W ) = θ1

N∑n=1

‖rn‖2 + θ2

K∑k=1

‖sk‖2 (12)

Denote λ1 = λθ1 and λ2 = λθ2. Substituting �(W ) in (12)into optimization problem (4), the proposed robust joint sparserepresentation based feature-level fusion tracker (RJSRFFT) isdeveloped as,

minW,R,S

1

2

K∑k=1

‖yk − Xkwk‖22 + λ1

N∑n=1

‖rn‖2 + λ2

K∑k=1

‖sk‖2

s.t. W = R + S (13)

The procedures to solve the optimization problem (13)will be given in the following section. The optimal fusedrepresentation is given by R and S, which encode the spar-sity patterns of reliable features and non-sparsity patterns ofunreliable visual cues, respectively. Therefore, when the non-sparsity patterns of unreliable features is encoded in S, thecorresponding columns of coefficients in S would be larger

Fig. 1. Graphical illustration of unreliable feature detection on synthetic data.(a) Original Weight Matrix. (b) Weight Matrix by MTJSR [15]. (c) Matrix Rby RJSRFFT. (d) Matrix S by RJSRFFT.

than zero. Therefore, we determine the index set O of theunreliable features as

O = {k ′, s.t., ‖sk′ ‖2 ≥ T } (14)

This scheme detects the unreliable visual cues when the normof some column of matrix S is larger than a pre-definedthreshold T . That is to say, the features which get strongresponse in the corresponding columns of S are determined asunreliable features. Since we have no information about theselected templates and unreliable features before learning thereconstruction coefficients, the matrix S may also encode somesmall components of reconstruction coefficients of reliablefeatures (can be found in Fig.1 (d) where matrix S shows somesmall non-zero components of reliable features). Therefore,if the threshold is too small, reliable features which have smallcomponents in S may be mistaken as unreliable features.On the other hand, if it is too large, unreliable features may notbe successfully detected. In our experiment, we empiricallyset it as 0.0007. On the other hand, the likelihood function isdefined by R and S as follows. The representation coefficientsof different visual cues are estimated and the unreliablefeatures are detected by solving the optimization problem (13).Then, the observation likelihood function in (2) is defined as

p(zt |lt ) ∝ ex p(− 1

K − K ′∑j /∈O

‖y j − X j · r j ‖22) (15)

where the right side of this equation denotes the averagereconstruction error of reliable visual cues. Since the proposedmodel can detect the unreliable cues, the likelihood functioncan combine the reconstruction error of reliable cues to definethe final similarity between the target candidate and the targettemplates.

D. Optimization Procedure

The objective function in (13) consists of a smooth andnon-smooth functions. This kind of optimization problem

Page 6: Multi-Cue Visual Tracking

LAN et al.: JOINT SPARSE REPRESENTATION AND ROBUST FEATURE-LEVEL FUSION 5831

can be solved efficiently by employing Accelerated ProximalGradient Method (APG). By using the first order information,the APG method can obtain the global optimal solution withthe convergence rate O( 1

t2 ), which has been shown in [68].It has also been further studied in [69] and [70] andapplied in solving multi-task sparse learning/representationproblem [67], [71], [72]. We apply the APG method sim-ilar to that in [71] and derive the following algorithm.Let

F(R, S) = 1

2

K∑k=1

f (rk, sk) = 1

2

K∑k=1

‖yk −N∑

n=1

xkn(rk

n + skn )‖2

2

G(R, S) = λ1

N∑n=1

‖rn‖2 + λ2

K∑k=1

‖sk‖2 (16)

where F are differentiable convex function with Lipschitzcontinous gradient, while G are non-differentiable but convexfunction. Then by taking the first-order Taylor expansion of Fwith the quadratic regularization function, we construct theobjective function with the quadratic upper approximationfor F and the non-differentiable function G at the aggregationmatrices Ut and V t as follows,

�(R, S) = 1

2

K∑k=1

{ f (uk,t , vk,t ) + (∇k,tu )T (rk − uk,t )

+ (∇k,tv )T (sk − vk,t )}

+ μ

2‖rk − uk,t‖2

2 + μ

2‖sk − vk,t ‖2

2

+ λ1

N∑n=1

‖rn‖2 + λ2

K∑k=1

‖sk‖2 (17)

where μ is the Lipschitz constant [69], ∇k,tu and ∇k,t

v arethe gradient operator of F(U, V ) on uk,t and vk,t , respec-tively. In the (t + 1)-th iteration, given the aggregationmatrices Ut and V t , the proximal matrices Rt+1 and St+1

are obtained by minimizing the following optimizationproblem,

arg minR,S

�(R, S) (18)

With some algebraic manipulations, (18) can be separated intotwo independent sub-problems about R and S, respectively,i.e.

minR

1

2

K∑k=1

‖rk − (uk,t − 1

μ∇k,t

u )‖22 + λ1

μ

N∑n=1

‖rn‖2

minS

1

2

K∑k=1

‖sk − (vk,t − 1

μ∇k,t

v )‖22 + λ2

μ

K∑k=1

‖sk‖2 (19)

where the gradient operators are given by ∇k,tu = −(Xk)T yk +

(Xk)T (Xk)uk,t + (Xk)T (Xk)vk,t , ∇k,tv = −(Xk)T yk +

(Xk)T (Xk)vk,t + (Xk)T (Xk)uk,t . As in [71], we solvethe above subproblem using the following two stepsiteratively:

Algorithm 1 Optimization Procedure for Problem (13)

1) Gradient Mapping Step: the proximal matrices Rt+1 andSt+1 are updated by (20) and (21), respectively.

rk,t+ 12 = uk,t − 1

μ∇k,t

u , k = 1, · · · , K ,

r t+1n = max(0, 1 − λ1

μ‖rt+ 1

2n ‖2

) · rt+ 1

2n , n = 1, · · · , N (20)

sk,t+ 12 = vk,t − 1

μ∇k,t

v , k = 1, · · · , K ,

sk,t+1 = max(0, 1 − λ2

μ‖sk,t+ 12 ‖2

) · sk,t+ 12 , k = 1, · · · , K

(21)

It should be noted that the update schemes (20) for R and (21)for S are different from each other, since R and S havedifferent sparsity properties grouping according to columnsand rows, respectively.

2) Aggregation Step: the aggregation matrix is updated asfollows.

Ut+1 = Rt+1 + at − 1

at+1(Rt+1 − Rt ),

V k+1 = St+1 + at − 1

at+1(St+1 − St ) (22)

where at+1 = 1+√

1+4a2t

2 , and a0 = 1.The overall optimization procedure is summarized in

Algorithm 1.

E. Kernelized Framework for Robust Feature-Level Fusion

In order to exploit the non-linearity of features which hasbeen shown to enhance the performance in many computervision tasks [6], [73], [74], we extend our algorithm into ageneral framework that can combine multiple visual featuresfrom different kernel space while detecting unreliable featuresfor tracking. Let φ denote the mapping function to a kernelspace. According to (3), sparse representation with template

Page 7: Multi-Cue Visual Tracking

5832 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 24, NO. 12, DECEMBER 2015

Algorithm 2 Optimization Procedure for Problem (25)

set for tracking result in linear space can be extended to akernel space as follows,

φ(yk) = �(Xk)wk + εk, k = 1, . . . , K . (23)

where �(Xk) = [φ(xk1), · · · , φ(xk

N )] denote the templateset in kernel space via the mapping function φ. Based onthe derivation leading to (13), the robust joint kernel sparserepresentation based feature-level fusion (K-JSRFFT) modelfor visual tracking is developed as,

minW,R,S

1

2

K∑k=1

‖φ(yk) − �(Xk)wk‖22 + λ1

N∑n=1

‖rn‖2

+ λ2

K∑k=1

‖sk‖2

s.t. W = R + S (24)

Expanding the reconstruction term in optimizationproblem (24), we can reformulate problem (24) as,

minW,R,S

1

2

K∑k=1

{K(yk, yk) − 2 · (wk)TK(Xk , yk)

+ (wk)TK(Xk , Xk)wk} + λ1

N∑n=1

‖rn‖2 + λ2

K∑k=1

‖sk‖2

s.t. W = R + S (25)

Here the element in the i -th row and j -th columns of thekernel matrix K(X, Y ) is defined as,

Ki, j (X, Y ) = φ(xi )T φ(y j ) (26)

where xi and y j are the i -th and j -th column of X and Y ,respectively. And, K(yk, yk) in problem (25) is equalto 1 because of the normalization to unit length. Basedon Algorithm 1, the optimization procedure for problem (25)can be derived and is summarized in Algorithm 2.

After solving (25), the representation coefficients of differ-ent visual cues are estimated and the unreliable features inkernel space are detected. Then, the observation likelihoodfunction can be re-defined as,

p(zt |lt )∝ ex p(− 1

K − K ′∑j /∈O

[K(y j , y j ) − 2·(w j )T K(X j , y j )

+ (w j )TK(X j , X j )w j ]) (27)

where O denote the index set of unreliable features in kernelspace as in (14), K and K ′ denote the total number of featuresand unreliable features in kernel space, respectively.

F. Computational Complexity

The major computation time of the proposed tracker isdue to the following processes: feature extraction, kernelmatrix computation, Lipschitz constant estimation and theoptimization procedure. The linear/non-linear kernel matrixfor the feature dictionary is computed once the templateupdate scheme is performed. Therefore, assuming that theflop count of the kernel mapping of two feature vectors is p,the computational complexity of the kernel matrix compu-tation for the corresponding feature dictionary is O(N2 p)where N is the number of templates in the template set.By using the tight Lipschitz constant estimation as discussedin Section IV, the computational complexity for Lipschitzconstant estimation is reduced from O(K 3 N3) to O(N3),where K is the number of visual cues. The optimizationprocedure as illustrated in Algorithms 1 and 2 is dominatedby the gradient computation in steps 3 and 4. The complexityof Steps 3 and 4 are O(K N2n) where n is the number ofparticles. The proposed tracker is implemented in MATLABwithout code optimization.

As measured on an i7 Quad-Core machine with 3.4 GHzCPU, the running time of K-RJSRFFT and RJSRFFT are2 sec/frame and 1.5 sec/frame, respectively. We will exploreways to increase its computation efficiency in our future workso as to perform in real-time.

IV. IMPLEMENTATION DETAILS

A. Template Update Scheme

The proposed tracker is sparse-based. We adopt the templateupdate scheme in [17] with a minor modification to fit our pro-posed fusion-based tracker with an outlier detection scheme.Similar to [17], we associate each template in different visualcues with a weight, and the weight is updated in each frame.Once the similarity between the templates with the largestweight from the reliable visual cue and the target sample of thecorresponding visual cue is less than a predefine threshold, theproposed tracker will replace the template which has the leastweight with the target sample. The difference between [17] andthe proposed method is that the update scheme in this paper isperformed simultaneously for template sets in different visualcues. Once one of the templates in a visual cue is replaced,the template in other visual cues will be replaced because theproposed model performs multi-cue fusion at feature level.As such, all the cues of the same template should be updated

Page 8: Multi-Cue Visual Tracking

LAN et al.: JOINT SPARSE REPRESENTATION AND ROBUST FEATURE-LEVEL FUSION 5833

Algorithm 3 Template Update Scheme

simultaneously. The template update scheme is summarized inAlgorithm 3.

It should be noted that the similarity threshold determinesthe template update frequency. A higher similarity thresholdwould result in a higher updating frequency while a lowersimilarity threshold would lead to a lower updating frequency.If the template set is updated too frequently, small errorswould be accumulated and the tracker would drift from thetarget gradually. On the other hand, if the template set remainsunchanged for a long time, the template set may not be welladaptive to appearance and background change. In our exper-iments, we set the similarity threshold as cos(35°) ≈ 0.82.

B. Tight Lipschitz Constant Estimation

The Lipschitz constant μ is important to above optimizationalgorithm. An improper value of μ will result in either diver-gence or slow convergence in above optimization algorithm.In this subsection, we will derive the method for tightLipschitz constant estimation. As such, the Lipschitz constantwill be automatically set in the initial frame or update oncethe template update scheme is performed. Proposition 1 givesthe method for tight Lipschitz constant estimation.

Propostion 1: Let F denote the function defined in (16)with {Xk |k = 1, . . . , K }, where Xk denote the template setin the k-th visual cue. The Lipschitz constant for ∇F can beestimated via its tight lower bound as follows,

μ ≥ max{2 · λkmax |k = 1, . . . , K } (28)

where λkmax is the largest eigenvalue of K(Xk , Xk).

The proof of Proposition 1 can be found in Appendix.We can see that with this method in (28), the tight lowerbound of the Lipschitz can be estimated just via the maximumeigenvalues of the sub-blocks of the Hessian matrix. Therefore,it avoids the direct computation of the eigenvalues of the

Hessian matrix in high dimension. In the experiment, we setthe value of Lipschitz constant as the lower bound.

V. EXPERIMENTS

In this section, we evaluate the proposed robust joint sparserepresentation-based feature-level fusion tracker (RJSRFFT)and its kernelized method (K-RJSRFFT) using both syntheticdata and real videos from publicly available datasets.

A. Unreliable Feature Detection on Synthetic Data

To demonstrate that the proposed method can detectunreliable features, we compare the RJSRFFT withthe weight matrices obtained by solving (4) with theregularization term (6) as in the multi-task joint sparserepresentation (MTJSR) method [15]. In this experiment,we simulate the multi-cue tracking problem by randomlygenerating five kinds of ten dimensional normalized featureswith 30 templates, i.e. Xk ∈ R

10×30, k = 1, · · · , 5 are thetemplate sets. Two kinds of features are set as unreliable withnon-sparsity patterns. For the other three kinds of reliablefeatures, we divide the template sets into three groups andrandomly generate the template weight vector wk ∈ R

30, suchthat the elements in wk corresponding to only one group oftemplates are non-zero. The testing sample of the k-th featureyk to represent the current tracking result is computed byXkwk plus a Gaussian noise vector with zero mean andvariance 0.2 to represent the reconstruction error εk . For afair comparison with the MTJSR [15], we extend our modelto impose the group lasso penalty by simply using a groupsparsity term in optimization problem (13). We empiricallyset parameters λ, λ1, λ2 as 0.001 and the step size μ as 0.002and repeat the experiment 100 times.

We use the average normalized mean square error betweenthe original weight matrix and the recovered one for evalua-tion. Our method RJSRFFT achieves a much lower averagerecovery error of 4.69% compared with that of the MTJSRwith 12.29%. This indicates that our method can better recoverthe underlying weight matrix by detecting the unreliable fea-tures successfully. To further demonstrate the ability for unre-liable feature detection, we give a graphical illustration of oneout of the 100 experiments in Fig.1. The original weight matrixis shown in Fig.1(a) with each row representing a weightvector wk . The horizontal axis records the sample indexes,while the vertical gives the values of weights. From Fig.1(a),we can see that the first three share the same sparsity patternsover the samples with indexes in the middle range, while allthe weights of the last two features are non-zeros, thus non-sparse. In this case, the MTJSR cannot discover the sparsitypatterns as shown in Fig.1(b), while the proposed RJSRFFTcan find out the shared sparsity of the reliable features anddetect unreliable features as shown in Fig.1(c) and 1(d). Thisalso explains the reason why our method can better recoverthe underlying matrix as shown in Fig.1(a).

B. Visual Tracking Experiments

To evaluate the performance of the proposed trackingalgorithms, i.e., RJSRFFT and K-RJSRFFT, we conduct

Page 9: Multi-Cue Visual Tracking

5834 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 24, NO. 12, DECEMBER 2015

TABLE I

THE CHALLENGING FACTOR AND ITS CORRESPONDING VIDEOS

experiment on thirty-three publicly available video sequences.2

The challenging factors in these video sequences includeocclusion, cluttered background, change of scale, illuminationand pose, which are listed in Table I. We compare theproposed tracking algorithms, RJSRFFT and K-RJSRFFTwith state-of-the-art tracking methods including multi-cue tracker: online boosting (OAB) [7], online multipleinstance learning (MIL) [8], semi-supervised onlineboosting (SemiB) [12], sparse representation-basedtracker: �1 tracker (L1APG) [55], sparse collaborativemodel (SCM) [56], multi-task tracker (MTT) [57], and otherstate-of-the-art methods: struck method (Struck) [49], circulantstructure tracker (CSK) [50], fragment tracker (Frag) [41],incremental learning tracker (IVT) [31], distribution fieldtracker (DFT) [77], compressive tracker (CT) [4]. The sourcecodes of these trackers are provided by the authors ofthese papers. For a fair comparison, all the trackers areset to be with the same initialization parameters. Toillustrate the effectiveness of the proposed unreliable featuredetection scheme that encodes the non-zero weights of thereliable feature in matrix S and detect the unreliable featurebased on the vector norm, we implement the Joint SparseRepresentation-based Feature-level Fusion Tacker withoutunreliable feature detection scheme (JSRFFT) for short asdiscussed in Section III(B). Both qualitative and quantitativeevaluation are presented in Section V(B).

1) Experiment Settings: For the proposed K-RJSRFFT, weextract six kinds of local and global features with totally3 kinds of kernel for fusion from a normalized 32 by 32image patch representing the target observation. For localvisual cues, we divide the tracking bounding box into 2 by2 blocks and extract covariance descriptor in each block withlog-Euclidean kernel function as in [78]. For global visualcues, we use HOG [79] with linear and polynomial kernel,and GLF [5] with linear kernel. For RJSRFFT and JSRFFT, weextract the same kinds of feature but only with linear kernel.We empirically set the parameters as follows. λ1 and λ2 isset to 0.57 and 0.87, respectively. The threshold T in (14) is

2can be downloaded from http://www.cs.technion.ac.il/∼amita/fragtrack/fragtrack.htm [41] http://www.cs.toronto.edu/∼dross/ivt/ [31] http://personal.ee.surrey.ac.uk/Personal/Z.Kalal/ [75], [21] http://www.cs.toronto.edu/vis/projects/dudekfaceSequence.html [25] http://ice.dlut.edu.cn/lu/Project/TIP12-SP/TIP12-SP.htm [59] http://cv.snu.ac.kr/research/∼vtd/ [33] http://lrs.icg.tugraz.at/research/houghtrack/ [76] http://www.umiacs.umd.edu/∼fyang/spt.html [51].

0.0007. And the number of templates is 12, and 200 samplesare drawn for particle filtering. We adopt the cosine functionas the similarity function, and empirically set the similaritythreshold τ as cos(35°) ≈ 0.82.

2) Visualization Experiment: To illustrate how unreliablefeatures are detected in the proposed method, for the trackingresult in the bounding box as shown in the upper-left imagein Fig. 2, we use features of the tracking results and the tem-plate set to obtain the reconstruction coefficients via solvingthe kernel sparse representation problem for each feature asshown in the small green box, and plot the value of the coef-ficients of each template of all features. The higher value ofcoefficient means the corresponding template is more represen-tative for reconstructing the target. As we can see, the recon-struction coefficients of GLF features do not demonstrate spar-sity pattern as other features, which means the GLF featuresextracted from the tracking result cannot be well representedby the corresponding template set. We can also note that formost features, the sixth template is selected as the most repre-sentative template with the highest value of coefficient, whichimplies the underlying relationship between different features.Through our proposed model, the sixth template are jointlyselected as the representative template for target representationas shown in the bottom-left figure and the GLF features alsodetected as an unreliable feature as shown in the bottom-rightfigure.

3) Quantitative Comparison: The performance of the com-pared trackers are evaluated quantitatively from two aspects:video-by-video comparison and video set-based comparison.Following [43], [56], [59]–[61], [80], three evaluation criteriaare used for quantitative comparison: center location error,overlap ratio and success rate. The center location erroris defined as the Euclidean distance between the centrallocation of a bounding box and the labeled ground truth.And the overlap ratio is defined as area(BT ∩BG)

area(BT ∪BG) , whereBT and BG are the bounding boxes of the tracker andground-truth. A frame is successfully tracked means thatthe overlap ratio is larger than 0.5. Tables II, III and IVshow the quantitative comparison of evaluated trackers interms of center location error, overlap ratio and successrate, respectively. And the last row of Table IV showsthe average frame-per-second running time of the comparedtrackers. The proposed trackers RJSRFFT and K-RJSRFFTrank in the top three among all trackers. The proposed

Page 10: Multi-Cue Visual Tracking

LAN et al.: JOINT SPARSE REPRESENTATION AND ROBUST FEATURE-LEVEL FUSION 5835

Fig. 2. Graphical Illustration of Unreliable Feature Detection Scheme. The upper-left image in purple box show the tracking result in the 658-th frameof video Occluded Face 1. The seven figures in the upper-right green box illustrate the resulting reconstruction coefficients via solving the �1 minimizationproblem as shown in small green box. After solving (24), the resulting reconstruction coefficients of each feature encoded in matrix R are illustrated in thebottom-left blue box, and the unreliable detection results encoded in matrix S are illustrated in the bottom-right blue box. The y-coordinate of all abovefigures denotes the value of coefficient, and the x-coordinate denotes the index of the template in the template set.

TABLE II

CENTER LOCATION ERROR (IN PIXELS). THE BEST THREE RESULTS ARE SHOWN IN RED, BLUE AND GREEN

trackers achieve much better results compared with sparserepresentation-based tracker using single feature, i.e., L1APG,SCM and MTT. This is because our tracker is based onmulti-feature sparse representations which can dynamically

utilize different features to handle different kinds of varia-tions while preserving the advantages of sparse representation.The proposed multi-cue trackers also perform better thanexisting multi-cue trackers, i.e., OAB, MIL, and SemiB. This

Page 11: Multi-Cue Visual Tracking

5836 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 24, NO. 12, DECEMBER 2015

TABLE III

VOC OVERLAPPING RATE. THE BEST THREE RESULTS ARE SHOWN IN RED, BLUE AND GREEN

TABLE IV

SUCCESS RATE. THE BEST THREE RESULTS ARE SHOWN IN RED, BLUE AND GREEN

is because the more informative features are dynamicallyexploited for fusion at feature level than that of score level.But feature level fusion-based tracker JSRFFT does not per-

form as well as RJSRFFT and K-RJSRFFT which suggeststhat unreliable feature detection scheme is essential forfeature-level fusion. Comparing the performance of RJSRFFT

Page 12: Multi-Cue Visual Tracking

LAN et al.: JOINT SPARSE REPRESENTATION AND ROBUST FEATURE-LEVEL FUSION 5837

Fig. 3. The survival curve based on the F-score for the entire thirty-three videos and the subset of videos containing different challenging factors, and theaverage F-scores are shown in the corresponding legend in descending order.(a) Whole set of videos. (b) Occlusion subset of videos. (c) Cluttered backgroundsubset of videos. (d) Scale subset of videos. (e) Illumination subset of videos. (f) Pose subset of videos.

and JSRFFT suggests that unreliable features detection canimprove the performance of joint sparse representation-basedtracker. K-RJSRFFT outperforms RJSRFFT which indicatesthat exploiting the non-linearity of features can improve theperformance of multi-cue tracker.

Recently, Smeulders et al. presented a comprehensive sur-vey and systematic evaluation on state-of-the-art visual track-ing algorithms in [23] based on the Amsterdam Library ofOrdinary Videos data set, named ALOV++. The evaluationprotocol in [23] shows that the F-score is an effective metricto evaluate a tracker’s accuracy, and the survival curve asillustrated in [23] provides a cumulative rendition of thequality of the tracker on a set of videos, which avoids the riskof being trapped in the peculiarity of the single video instance.Therefore, to conduct the video set-based comparison, wecompute the F-score using the tracking results of each trackeron each video, and plot the corresponding survival curve onthe entire thirty-three videos and the subset of videos con-taining different challenging factors as shown in Table I. TheF-score for each video is defined as 2 · (precision · recall)/(precision + recall), where precision = ntp/(ntp + n f p),recall = ntp/(ntp + n f n), and ntp, n f p, n f n denote thenumber of true positives, false positives, false negatives ina video. After getting the F-score of each video in the videoset, the sets of videos are sorted in the descending order andthe curve survival curve of F-score is plotted with respect tothe sorted videos. Fig. 3 illustrates the survival curves basedon F-score on the whole thirty-three videos and the subset ofvideos containing different challenging factors. We can seethat the K-RJSRFFT method achieve the best performanceon the whole video and subsets of video containing differentfactors in term of average F-core. By performing feature-level fusion in multiple kernel space, the K-RJSRFFT methodperform better than the JSRFFT method. With the unreliable

feature detection scheme, both K-RJSRFFT and RJSRFFTmethod perform better than the JSRFFT method, which showsthe effectiveness of the unreliable feature detection scheme.By utilizing the complementarity of multiple reliable features,both K-RJSRFFT and RJSRFFT method show superior per-formance under different challenging factors, especially inocclusion, illumination and cluttered background.

4) Qualitative Comparison: We qualitatively evaluate thetrackers in five aspects based on some typical video sequencesshown in Fig. 4 as follows:

a) Occlusion: For the video Occluded Face 1, a womanface undergoes some partial and severe occlusion by a bookin some frames. Except that the CT, DFT, JSRFFT andMIL methods have small drift when the woman’s face isseverely occluded around Frame 730 and Frame 873, all othertrackers can successfully track the woman’s face through thewhole video. For the video David Outdoor, a man is walkingwith body deformation and occlusion. The contrast betweenthe target and background is low which makes the SemiB,MIL, CT, L1APG, Struck, JSRFFT methods lose the targetaround Frame 49, 108, 120, 195, 238, respectively. The fullbody occlusion also cause the tracking drift problem of theMTT, OAB, SCM, CSK, JSRFFT, DFT, IVT methods aroundFrame 27, 83, 188, respectively. Because the K-RJSRFFTmethod combines local and global features from non-linearkernel space to account for the variation in foreground andbackground, it performs well in the whole sequence. For thesequence Walking2, a woman is walking with change of scale.Most trackers perform well except that the Frag, OAB, SemiB,CT, MIL, JSRFFT methods drift away form the woman dueto the occlusion caused by a man with similar appearance.As can be seen from above comparison, the proposed trackersRJSRFFT and K-RJSRFFT can handle occlusion well. Withthe similar merit of [3], the fusion of local features enables it

Page 13: Multi-Cue Visual Tracking

5838 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 24, NO. 12, DECEMBER 2015

Fig. 4. Qualitative results on some typical frames including some challenging factors. (a) Occlusion. (b) Background. (c) Scale. (d) Illumination. (e) Pose.

to be less sensitive to partial occlusion. Besides, different fromthe JSRFFT method which directly performs fusion of all fea-tures, unreliable feature detection scheme in the RJSRFFT andK-RJSRFFT methods can improve the tracking performanceof joint sparse representation-based trackers since the featuresextracted from occluded object may not be reliable.

b) Cluttered background: In the video Crossing, thecontrast between the walking person and the backgroundis low, which leads to the drift problem of some trackers,i.e., the MTT, L1APG, DFT and Frag methods aroundFrame 55 and 83. In the video Mountain-bike, a man ridinga montain bike undergoes pose variation with cluttered back-ground. The cluttered background makes the CT, Frag, DFT,and JSRFFT methods drift away from the tracked objectaround Frame 50 and 95, respectively. In the video Jumping,A man is jumping rope with motion blur occurring on his face.The cluttered background makes it easy for trackers such asthe IVT, SCM, L1APG, MTT, OAM, CT, DFT, CSK, SemiBmethods drift from the blurred face as shown in Frame 117.

The proposed tracker K-RJSRFFT performs well under acluttered background, and shows improved results comparedwith the RJSRFFT method. This is because different fromthe RJSRFFT method, the K-RJSRFFT method utilizes thenon-linear similarity of feature and performs feature fusion ina more discriminative kernel space, which enhances the dis-criminability of appearance model and makes it less sensitiveto cluttered background.

c) Scale: In the video Car4, there is a drastic changeof scale and illumination when the car goes underneath theoverpass. Only the CSK, SCM, IVT, RJSRFFT, K-RJSRFFT,Struck and JSRFFT methods can perform well in the wholesequence while the other trackers have some small or largedrift away from the car. In the David indoor sequence, a manis walking out of the dark room with a gradual change inscale and pose. The IVT, SCM, MIL, RJSRFFT, K-RJSRFFT,JSRFFT methods perform well on this sequence. Other track-ers have some drift when the man undergoes pose and scalevariation. (e.g. Frame 166). In the Freeman3 sequence, a manis walking in the classroom from the back to the front witha gradual change in scale. Due to the similar appearanceof human face in the background and the scale change ofthe target, the OAB, Struck, Frag, CT, MIL, CSK, IVT,L1APG, DFT methods lose the tracking of the target. TheK-RJSRFFT, RJSRFFT and JSRFFT methods perform wellon this sequence.

d) Illumination: For the challenging video Trellis, theobject appearance changes significantly because of illumina-tion and pose. the OAB, SemiB, IVT, MTT, L1APG, CT,MIL methods lose the target due to the cast shadow on theface. When the object changes its pose, the SCM methoddrifts away from the target. Only the RJSRFFT, K-RJSRFFT,and JSRFFT methods perform well in the whole sequenceas a result of the fusion of local illumination insensitivefeature which enables them to handle such kinds of variations.

Page 14: Multi-Cue Visual Tracking

LAN et al.: JOINT SPARSE REPRESENTATION AND ROBUST FEATURE-LEVEL FUSION 5839

Video Skating1 is also challenging in which a woman isskating under variations of pose, illumination and background.Some partial occlusion also occurs as shown in Frame 164.Except the proposed trackers K-RJSRFFT and RJSRFFT,other trackers do not perform well in the whole sequence.And the dramatic change of lighting in the backgroundlet the RJSRFFT method drift away from target while theK-RJSRFFT method still tracks the target stably which showsthat fusion features in kernel space can enhance the discr-minability of feature fusion based trackers. In the sequenceCar11, the significant change of illumination and low resolu-tion of object makes tracking difficult. Except the DFT, CT,Frag, MIL methods, all others trackers are able to track thetarget with high success rate.

e) Pose: The Shaking sequence is challenging because ofsignificant pose variations. The proposed method K-RJSRFFTcan successfully track the target. The RJSRFFT, OAB, MIL,CSK methods have small drift away from the target due to thepose variation. Other trackers such as the JSRFFT, CT, IVT,SemiB, MTT, L1APG methods fail to track the target dueto the combined variation of pose and occlusion as well asillumination around Frame 59. For the Basketball sequence,the target is a basketball player running on the court whoundergoes significant pose variation as well as some partialocclusion. The Struck, OAB, JSRFFT methods drift fromthe target around Frame 64. And when another player withsimilar appearance goes close to the player around Frame 302,the MIL, CT and L1APG method lose the target. Only theK-RJSRFFT, DFT, CSK and Frag methods perform well inthis sequence. In the sequence Sylvester, the object undergoessignificant pose variation around frame 1091, and the DFT,IVT, SCM, L1APG, Frag and MIL method drift away fromthe target. Our proposed trackers K-RJSRFFT and RJSRFFTare able to track the target with high success rate.

VI. CONCLUSION AND FUTURE WORK

In this paper, we have formulated a feature-level fusionvisual tracker based on joint sparse representation. The pro-posed method outperforms other feature fusion-based trackersand sparse representation-based trackers under appearancevariations such as occlusion, scale, illumination and poseswhich can be shown in the experimental results. This demon-strates that the proposed method for robust feature-levelfusion is effective. On the other hand, since the proposedmethod is sparsity-based tracker, it is a bit difficult to per-form real-time tracking. Therefore, one of our future workswill be developping more efficient methods to reduce thecomputational complexity for practical application. Besides,how to extend the optimization model to other computervision problem will be explored in future. In addition, wewill explore how to fuse more information, such as priorinformation, context information in our multi-cue trackingframework.

APPENDIX

PROOF OF PROPOSITION 1

Proof: The Hessian matrix H ∈ R(2·N ·K )×(2·N ·K ) for

function F in (16) is a block diagonal matrix with Bk as the

k-th block, where

Bk =⎡⎢⎣

∂ F

∂2rk

∂ F

∂rk∂sk∂ F

∂rk∂sk

∂ F

∂2sk

⎤⎥⎦

=[K(Xk , Xk) K(Xk, Xk)

K(Xk , Xk) K(Xk, Xk)

]∈ R

(2·N)×(2·N) (29)

Assume that �k = diag(λk1, . . . , λ

kN ) is a diagonal matrix

whose elements are the eigenvalues for Bk , and Zk is a matrixwith eigenvectors of Bk as column vectors. That is to say,Bk Zk = Zk�k .

Let Z = diag(Z1, · · · , Z K ), and� = diag(�1, · · · ,�K ), then HZ =diag(B1Z1, . . . , BK Z K ) = diag(Z1�1, . . . , Z K �K ) =diag(Z1, · · · , Z K )diag(�1, · · · ,�K ) = Z�. That is tosay, the eigenvalues of H are the same as its diagonal blockB1, . . . , BK . Hence,

λmax (H) = max{λmax(Bk)|k = 1, . . . , K }= max{2 · λmax (K(Xk , Xk))|k = 1, . . . , K } (30)

where λmax(H) and λmax (Bk) are maximum eigenvalue of Hand Bk , respectively. The proof is done.

ACKNOWLEDGMENT

The authors would like to thank the editor and reviewersfor their helpful comments which improve the quality of thispaper.

REFERENCES

[1] X. Li, W. Hu, C. Shen, Z. Zhang, A. Dick, and A. Van Den Hengel,“A survey of appearance models in visual object tracking,” ACM Trans.Intell. Syst. Technol., vol. 4, no. 4, p. 58, Sep. 2013.

[2] S. He, Q. Yang, R. W. H. Lau, J. Wang, and M.-H. Yang, “Visualtracking via locality sensitive histograms,” in Proc. CVPR, Jun. 2013,pp. 2427–2434.

[3] W. Hu, X. Li, W. Luo, X. Zhang, S. J. Maybank, and Z. Zhang, “Singleand multiple object tracking using log-Euclidean Riemannian subspaceand block-division appearance model,” IEEE Trans. Pattern Anal. Mach.Intell., vol. 34, no. 12, pp. 2420–2440, Dec. 2012.

[4] K. Zhang, L. Zhang, and M.-H. Yang, “Real-time compressive tracking,”in Proc. ECCV, 2012, pp. 864–877.

[5] W. W. Zou, P. C. Yuen, and R. Chellappa, “Low-resolution face trackerrobust to illumination variations,” IEEE Trans. Image Process., vol. 22,no. 5, pp. 1726–1739, May 2013.

[6] F. Yang, H. Lu, and M.-H. Yang, “Robust visual tracking via multiplekernel boosting with affinity constraints,” IEEE Trans. Circuits Syst.Video Technol., vol. 24, no. 2, pp. 242–254, Feb. 2014.

[7] H. Grabner and H. Bischof, “On-line boosting and vision,” in Proc.CVPR, Jun. 2006, pp. 260–267.

[8] B. Babenko, M.-H. Yang, and S. Belongie, “Robust object tracking withonline multiple instance learning,” IEEE Trans. Pattern Anal. Mach.Intell., vol. 33, no. 8, pp. 1619–1632, Aug. 2011.

[9] Y. Wu, E. Blasch, G. Chen, L. Bai, and H. Ling, “Multiple source datafusion via sparse representation for robust visual tracking,” in Proc. Int.Conf. Inf. Fusion, Jul. 2011, pp. 1–8.

[10] A. J. Ma, P. C. Yuen, and J.-H. Lai, “Linear dependency modeling forclassifier fusion and feature combination,” IEEE Trans. Pattern Anal.Mach. Intell., vol. 35, no. 5, pp. 1135–1148, May 2013.

[11] A. J. Ma and P. C. Yuen, “Reduced analytic dependency modeling:Robust fusion for visual recognition,” Int. J. Comput. Vis., vol. 109,no. 3, pp. 233–251, Sep. 2014.

[12] H. Grabner, C. Leistner, and H. Bischof, “Semi-supervised on-lineboosting for robust tracking,” in Proc. ECCV, 2008, pp. 234–247.

[13] T. M. Cover and J. A. Thomas, Elements of Information Theory.New York, NY, USA: Wiley, 2006.

Page 15: Multi-Cue Visual Tracking

5840 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 24, NO. 12, DECEMBER 2015

[14] X. Jia, D. Wang, and H. Lu, “Fragment-based tracking using onlinemultiple kernel learning,” in Proc. ICIP, Sep./Oct. 2012, pp. 393–396.

[15] X.-T. Yuan and S. Yan, “Visual classification with multi-task joint sparserepresentation,” in Proc. CVPR, Jun. 2010, pp. 3493–3500.

[16] J. Wright, A. Y. Yang, A. Ganesh, S. S. Sastry, and Y. Ma, “Robust facerecognition via sparse representation,” IEEE Trans. Pattern Anal. Mach.Intell., vol. 31, no. 2, pp. 210–227, Feb. 2009.

[17] X. Mei and H. Ling, “Robust visual tracking and vehicle classificationvia sparse representation,” IEEE Trans. Pattern Anal. Mach. Intell.,vol. 33, no. 11, pp. 2259–2272, Nov. 2011.

[18] P. Gehler and S. Nowozin, “On feature combination for multiclass objectclassification,” in Proc. ICCV, Sep./Oct. 2009, pp. 221–228.

[19] X. Lan, A. J. Ma, and P. C. Yuen, “Multi-cue visual tracking using robustfeature-level fusion based on joint sparse representation,” in Proc. CVPR,Jun. 2014, pp. 1194–1201.

[20] A. Yilmaz, O. Javed, and M. Shah, “Object tracking: A survey,” ACMComput. Surv., vol. 38, no. 4, p. 13, 2006.

[21] Y. Wu, J. Lim, and M.-H. Yang, “Online object tracking: A benchmark,”in Proc. CVPR, Jun. 2013, pp. 2411–2418.

[22] S. Zhang, H. Yao, X. Sun, and X. Lu, “Sparse coding based visualtracking: Review and experimental comparison,” Pattern Recognit.,vol. 46, no. 7, pp. 1772–1788, Jul. 2013.

[23] A. W. M. Smeulders, D. M. Chu, R. Cucchiara, S. Calderara,A. Dehghan, and M. Shah, “Visual tracking: An experimental survey,”IEEE Trans. Pattern Anal. Mach. Intell., vol. 36, no. 7, pp. 1442–1468,Jul. 2014.

[24] H. Yang, L. Shao, F. Zheng, L. Wang, and Z. Song, “Recent advancesand trends in visual tracking: A review,” Neurocomputing, vol. 74,no. 18, pp. 3823–3831, Nov. 2011.

[25] A. D. Jepson, D. J. Fleet, and T. F. El-Maraghi, “Robust onlineappearance models for visual tracking,” IEEE Trans. Pattern Anal. Mach.Intell., vol. 25, no. 10, pp. 1296–1311, Oct. 2003.

[26] S. K. Zhou, R. Chellappa, and B. Moghaddam, “Visual trackingand recognition using appearance-adaptive models in particle fil-ters,” IEEE Trans. Image Process., vol. 13, no. 11, pp. 1491–1506,Nov. 2004.

[27] D. Comaniciu, V. Ramesh, and P. Meer, “Kernel-based object tracking,”IEEE Trans. Pattern Anal. Mach. Intell., vol. 25, no. 5, pp. 564–577,May 2003.

[28] C. Shen, M. J. Brooks, and A. van den Hengel, “Fast global kerneldensity mode seeking: Applications to localization and tracking,” IEEETrans. Image Process., vol. 16, no. 5, pp. 1457–1469, May 2007.

[29] I. Leichter, “Mean shift trackers with cross-bin metrics,” IEEE Trans.Pattern Anal. Mach. Intell., vol. 34, no. 4, pp. 695–706, Apr. 2012.

[30] M. J. Black and A. D. Jepson, “EigenTracking: Robust matching andtracking of articulated objects using a view-based representation,” inProc. ECCV, 1996, pp. 329–342.

[31] D. A. Ross, J. Lim, R.-S. Lin, and M.-H. Yang, “Incremental learning forrobust visual tracking,” Int. J. Comput. Vis., vol. 77, no. 1, pp. 125–141,May 2008.

[32] L. Wen, Z. Cai, Z. Lei, D. Yi, and S. Z. Li, “Robust online learnedspatio-temporal context model for visual tracking,” IEEE Trans. ImageProcess., vol. 23, no. 2, pp. 785–796, Feb. 2014.

[33] J. Kwon and K. M. Lee, “Visual tracking decomposition,” in Proc.CVPR, Jun. 2010, pp. 1269–1276.

[34] X. Mei, H. Ling, Y. Wu, E. P. Blasch, and L. Bai, “Efficient minimumerror bounded particle resampling L1 tracker with occlusion detection,”IEEE Trans. Image Process., vol. 22, no. 7, pp. 2661–2675, Jul. 2013.

[35] X. Li, C. Shen, Q. Shi, A. Dick, and A. van den Hengel, “Non-sparselinear representations for visual tracking with online reservoir metriclearning,” in Proc. CVPR, Jun. 2012, pp. 1760–1767.

[36] S. Zhang, H. Zhou, F. Jiang, and X. Li, “Robust visual trackingusing structurally random projection and weighted least squares,” to bepublished, doi: 10.1109/TCSVT.2015.2406194.

[37] S. Zhang, H. Yao, H. Zhou, X. Sun, and S. Liu, “Robust visual track-ing based on online learning sparse representation,” Neurocomputing,vol. 100, pp. 31–40, Jan. 2013.

[38] S. Zhang, H. Yao, X. Sun, and S. Liu, “Robust visual tracking using aneffective appearance model based on sparse coding,” ACM Trans. Intell.Syst. Technol., vol. 3, no. 3, p. 43, May 2012.

[39] Z. Hong, X. Mei, D. Prokhorov, and D. Tao, “Tracking via robust multi-task multi-view joint sparse representation,” in Proc. ICCV, Dec. 2013,pp. 649–656.

[40] X. Mei, Z. Hong, D. Prokhorov, and D. Tao, “Robust multitask multiviewtracking in videos,” IEEE Trans. Neural. Netw. Learn. Syst., to bepublished, doi: 10.1109/TNNLS.2015.2399233.

[41] A. Adam, E. Rivlin, and I. Shimshoni, “Robust fragments-basedtracking using the integral histogram,” in Proc. CVPR, Jun. 2006,pp. 798–805.

[42] S. Avidan, “Support vector tracking,” IEEE Trans. Pattern Anal. Mach.Intell., vol. 26, no. 8, pp. 1064–1072, Aug. 2004.

[43] X. Li, A. R. Dick, C. Shen, Z. Zhang, A. van den Hengel, and H. Wang,“Visual tracking with spatio-temporal Dempster–Shafer informationfusion,” IEEE Trans. Image Process., vol. 22, no. 8, pp. 3028–3040,Aug. 2013.

[44] K. Zhang and H. Song, “Real-time visual tracking via onlineweighted multiple instance learning,” Pattern Recognit., vol. 46, no. 1,pp. 397–411, Jan. 2013.

[45] N. Jiang, W. Liu, and Y. Wu, “Learning adaptive metric for robust visualtracking,” IEEE Trans. Image Process., vol. 20, no. 8, pp. 2288–2300,Aug. 2011.

[46] N. Jiang, W. Liu, and Y. Wu, “Order determination and sparsity-regularized metric learning adaptive visual tracking,” in Proc. CVPR,Jun. 2012, pp. 1956–1963.

[47] N. Jiang and W. Liu, “Data-driven spatially-adaptive metric adjustmentfor visual tracking,” IEEE Trans. Image Process., vol. 23, no. 4,pp. 1556–1568, Apr. 2014.

[48] W. Hu, X. Li, X. Zhang, X. Shi, S. J. Maybank, and Z. Zhang,“Incremental tensor subspace learning and its applications to fore-ground segmentation and tracking,” Int. J. Comput. Vis., vol. 91, no. 3,pp. 303–327, Feb. 2011.

[49] S. Hare, A. Saffari, and P. H. S. Torr, “Struck: Structured output trackingwith kernels,” in Proc. ICCV, Nov. 2011, pp. 263–270.

[50] J. F. Henriques, R. Caseiro, P. Martins, and J. Batista, “Exploitingthe circulant structure of tracking-by-detection with kernels,” in Proc.ECCV, 2012, pp. 702–715.

[51] F. Yang, H. Lu, and M.-H. Yang, “Robust superpixel tracking,” IEEETrans. Image Process., vol. 23, no. 4, pp. 1639–1651, Apr. 2014.

[52] H. Li, C. Shen, and Q. Shi, “Real-time visual tracking using compressivesensing,” in Proc. CVPR, Jun. 2011, pp. 1305–1312.

[53] B. Liu, L. Yang, J. Huang, P. Meer, L. Gong, and C. Kulikowski, “Robustand fast collaborative tracking with two stage sparse optimization,” inProc. ECCV, 2010, pp. 624–637.

[54] B. Liu, J. Huang, C. Kulikowski, and L. Yang, “Robust visualtracking using local sparse appearance model and K-selection,” IEEETrans. Pattern Anal. Mach. Intell., vol. 35, no. 12, pp. 2968–2981,Dec. 2013.

[55] C. Bao, Y. Wu, H. Ling, and H. Ji, “Real time robust L1 tracker usingaccelerated proximal gradient approach,” in Proc. CVPR, Jun. 2012,pp. 1830–1837.

[56] W. Zhong, H. Lu, and M.-H. Yang, “Robust object tracking via sparsecollaborative appearance model,” IEEE Trans. Image Process., vol. 23,no. 5, pp. 2356–2368, May 2014.

[57] T. Zhang, B. Ghanem, S. Liu, and N. Ahuja, “Robust visual trackingvia structured multi-task sparse learning,” Int. J. Comput. Vis., vol. 101,no. 2, pp. 367–383, Jan. 2013.

[58] X. Jia, H. Lu, and M.-H. Yang, “Visual tracking via adaptive struc-tural local sparse appearance model,” in Proc. CVPR, Jun. 2012,pp. 1822–1829.

[59] D. Wang, H. Lu, and M.-H. Yang, “Online object tracking with sparseprototypes,” IEEE Trans. Image Process., vol. 22, no. 1, pp. 314–325,Jan. 2013.

[60] B. Zhuang, H. Lu, Z. Xiao, and D. Wang, “Visual tracking via discrim-inative sparse similarity map,” IEEE Trans. Image Process., vol. 23,no. 4, pp. 1872–1881, Apr. 2014.

[61] Q. Wang, F. Chen, J. Yang, W. Xu, and M.-H. Yang, “Transferring visualprior for online object tracking,” IEEE Trans. Image Process., vol. 21,no. 7, pp. 3296–3305, Jul. 2012.

[62] L. Wang, H. Yan, K. Lv, and C. Pan, “Visual tracking via kernel sparserepresentation with multikernel fusion,” IEEE Trans. Circuits Syst. VideoTechnol., vol. 24, no. 7, pp. 1132–1141, Jul. 2014.

[63] Y. Wu, B. Ma, M. Yang, J. Zhang, and Y. Jia, “Metric learningbased structural appearance model for robust visual tracking,” IEEETrans. Circuits Syst. Video Technol., vol. 24, no. 5, pp. 865–877,May 2014.

[64] H. Zhang, N. M. Nasrabadi, Y. Zhang, and T. S. Huang, “Multi-observation visual recognition via joint dynamic sparse representation,”in Proc. ICCV, Nov. 2011, pp. 595–602.

[65] S. Shekhar, V. M. Patel, N. M. Nasrabadi, and R. Chellappa, “Jointsparse representation for robust multimodal biometrics recognition,”IEEE Trans. Pattern Anal. Mach. Intell., vol. 36, no. 1, pp. 113–126,Jan. 2014.

Page 16: Multi-Cue Visual Tracking

LAN et al.: JOINT SPARSE REPRESENTATION AND ROBUST FEATURE-LEVEL FUSION 5841

[66] S. Bahrampour, A. Ray, N. M. Nasrabadi, and K. W. Jenkins, “Quality-based multimodal classification using tree-structured sparsity,” in Proc.CVPR, Jun. 2014, pp. 4114–4121.

[67] P. Gong, J. Ye, and C. Zhang, “Robust multi-task feature learning,” inProc. KDD, 2012, pp. 895–903.

[68] Y. Nesterov, “Gradient methods for minimizing composite functions,”Math. Program., vol. 140, no. 1, pp. 125–161, Aug. 2013.

[69] A. Beck and M. Teboulle, “A fast iterative shrinkage-thresholdingalgorithm for linear inverse problems,” SIAM J. Imag. Sci., vol. 2, no. 1,pp. 183–202, 2009.

[70] P. Tseng, “On accelerated proximal gradient methods for convex-concave optimization,” SIAM J. Optim., 2008.

[71] X.-T. Yuan, X. Liu, and S. Yan, “Visual classification with multitaskjoint sparse representation,” IEEE Trans. Image Process., vol. 21, no. 10,pp. 4349–4360, Oct. 2012.

[72] X. Chen, W. Pan, J. T. Kwok, and J. G. Carbonell, “Acceleratedgradient method for multi-task sparse learning problem,” in Proc. ICDM,Dec. 2009, pp. 746–751.

[73] A. Shrivastava, V. M. Patel, and R. Chellappa, “Multiple kernel learn-ing for sparse representation-based classification,” IEEE Trans. ImageProcess., vol. 23, no. 7, pp. 3013–3024, Jul. 2014.

[74] S. Gao, I. W. Tsang, and L.-T. Chia, “Sparse representation withkernels,” IEEE Trans. Image Process., vol. 22, no. 2, pp. 423–434,Feb. 2013.

[75] Z. Kalal, K. Mikolajczyk, and J. Matas, “Tracking-learning-detection,”IEEE Trans. Pattern Anal. Mach. Intell., vol. 34, no. 7, pp. 1409–1422,Jul. 2012.

[76] M. Godec, P. M. Roth, and H. Bischof, “Hough-based tracking of non-rigid objects,” in Proc. ICCV, Nov. 2011, pp. 81–88.

[77] L. Sevilla-Lara and E. G. Learned-Miller, “Distribution fields for track-ing,” in Proc. CVPR, Jun. 2012, pp. 1910–1917.

[78] P. Li, Q. Wang, W. Zuo, and L. Zhang, “Log-Euclidean kernels forsparse representation and dictionary learning,” in Proc. ICCV, Dec. 2013,pp. 1601–1608.

[79] N. Dalal and B. Triggs, “Histograms of oriented gradients for humandetection,” in Proc. CVPR, Jun. 2005, pp. 886–893.

[80] Q. Wang, F. Chen, W. Xu, and M.-H. Yang, “Object tracking via partialleast squares analysis,” IEEE Trans. Image Process., vol. 21, no. 10,pp. 4454–4465, Oct. 2012.

Xiangyuan Lan (S’14) received the B.Eng. degreein computer science and technology from the SouthChina University of Technology, China, in 2012.He is currently pursuing the Ph.D. degree withthe Department of Computer Science, Hong KongBaptist University, Hong Kong.

Mr. Lan was a recipient of the National Schol-arship from the Ministry of Education of China,in 2010 and 2011, and the IBM China ExcellentStudent Scholarship from IBM and the China Schol-arship Council in 2012. In 2015, he was a Visiting

Scholar with the Computer Vision Laboratory, University of Maryland Insti-tute for Advanced Computer Studies, University of Maryland, College Park,USA.

Mr. Lan’s current research interests include computer vision and patternrecognition, particularly, feature fusion and sparse representation for visualtracking.

Andy J. Ma received the B.Sc. and M.Sc. degreesin applied mathematics from Sun Yat-sen Univer-sity, in 2007 and 2009, respectively, and the Ph.D.degree from the Department of Computer Science,Hong Kong Baptist University, in 2013. He is cur-rently a Post-Doctoral Fellow with the Departmentof Computer Science, Johns Hopkins University.His current research interests focus on developingmachine learning algorithms for intelligent videosurveillance.

Pong C. Yuen (M’93–SM’11) received the B.Sc.(Hons.) degree in electronic engineering from CityPolytechnic of Hong Kong, in 1989, and thePh.D. degree in electrical and electronic engineer-ing from The University of Hong Kong, in 1993.He joined the Hong Kong Baptist University,in 1993, where he is currently a Professor and theHead of the Department of Computer Science.

Dr. Yuen spent a six-month sabbatical leave withThe University of Maryland Institute for AdvancedComputer Studies, University of Maryland at Col-

lege Park, in 1998. From 2005 to 2006, he was a Visiting Professor with theGRAphics, VIsion and Robotics Laboratory, INRIA, Rhone Alpes, France.He was the Director of the Croucher Advanced Study Institute on BiometricAuthentication in 2004, and the Director of Croucher ASI on BiometricSecurity and Privacy in 2007. He was a recipient of the University Fellowshipto visit The University of Sydney in 1996.

Dr. Yuen has been actively involved in many international conferences asan Organizing Committee and/or a Technical Program Committee Member.He was the Track Co-Chair of the International Conference on PatternRecognition in 2006 and the Program Co-Chair of the IEEE Fifth InternationalConference on Biometrics: Theory, Applications and Systems in 2012. He isan Editorial Board Member of Pattern Recognition and the Associate Editor ofthe IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY,and the SPIE Journal of Electronic Imaging. He is also serving as aHong Kong Research Grants Council Engineering Panel Member.

Dr. Yuen’s current research interests include video surveillance, human facerecognition, biometric security, and privacy.

Rama Chellappa (S’78–M’79–SM’83–F’92)received the B.E. (Hons.) degree in electronics andcommunication engineering from the University ofMadras, India, the M.E. (Hons.) degree from theIndian Institute of Science, Bangalore, India, and theM.S.E.E. and Ph.D. degrees in electrical engineeringfrom Purdue University, West Lafayette, IN. From1981 to 1991, he was a Faculty Member withthe Department of Electrical Engineering-Systems,University of Southern California (USC). Since1991, he has been a Professor of Electrical and

Computer Engineering and an Affiliate Professor of Computer Science withthe University of Maryland (UMD), College Park. He is also affiliatedwith the Center for Automation Research and the Institute for AdvancedComputer Studies (Permanent Member) and is serving as the Chair of theElectrical and Computer Engineering Department. In 2005, he was namedMinta Martin Professor of Engineering. His current research interests spanmany areas in image processing, computer vision, and pattern recognition.He is a Golden Core Member of the IEEE Computer Society, served asa Distinguished Lecturer of the IEEE Signal Processing Society and asthe President of the IEEE Biometrics Council. He is a fellow of IAPR,Optical Society, American Association for the Advancement of Science,Association for Computing Machinery, Association for the Advancement ofArtificial Intelligence, and holds four patents. He is a recipient of an NSFPresidential Young Investigator Award and four IBM Faculty DevelopmentAwards. He received two paper awards and the K.S. Fu Prize from theInternational Association of Pattern Recognition (IAPR). He is a recipientof the Society, Technical Achievement, and Meritorious Service Awardsfrom the IEEE Signal Processing Society. He also received the TechnicalAchievement and Meritorious Service Awards from the IEEE ComputerSociety. He is also a recipient of Excellence in teaching award fromthe School of Engineering at USC. At UMD, he received the college anduniversity level recognitions for research, teaching, innovation, and mentoringundergraduate students. In 2010, he was recognized as an OutstandingElectrical and Computer Engineering by Purdue University. He served asthe Editor-in-Chief of the IEEE TRANSACTIONS ON PATTERN ANALYSISAND MACHINE INTELLIGENCE and the General and Technical ProgramChair/Co-Chair of several IEEE International and National Conferences andWorkshops.