camera on-boarding for person re-identification using … · 2020-06-29 · camera on-boarding for...

Camera On-boarding for Person Re-identification using Hypothesis Transfer

Learning

Sk Miraj Ahmed1,∗, Aske R Lejbølle2,∗,†, Rameswar Panda2, Amit K. Roy-Chowdhury1

1 University of California, Riverside, 2 Aalborg University, Denmark, 3 IBM Research AI, Cambridge

{sahme047@, alejboel@, rpand002@, amitrc@ece}.ucr.edu

Abstract

Most of the existing approaches for person re-

identification consider a static setting where the number of

cameras in the network is fixed. An interesting direction,

which has received little attention, is to explore the dynamic

nature of a camera network, where one tries to adapt the ex-

isting re-identification models after on-boarding new cam-

eras, with little additional effort. There have been a few

recent methods proposed in person re-identification that at-

tempt to address this problem by assuming the labeled data

in the existing network is still available while adding new

cameras. This is a strong assumption since there may exist

some privacy issues for which one may not have access to

those data. Rather, based on the fact that it is easy to store

the learned re-identifications models, which mitigates any

data privacy concern, we develop an efficient model adapta-

tion approach using hypothesis transfer learning that aims

to transfer the knowledge using only source models and

limited labeled data, but without using any source cam-

era data from the existing network. Our approach mini-

mizes the effect of negative transfer by finding an optimal

weighted combination of multiple source models for trans-

ferring the knowledge. Extensive experiments on four chal-

lenging benchmark datasets with a variable number of cam-

eras well demonstrate the efficacy of our proposed approach

over state-of-the-art methods.

1. Introduction

Person re-identification (re-id), which addresses the

problem of matching people across different cameras, has

attracted intense attention in recent years [7, 29, 51]. Much

progress has been made in developing a variety of methods

to learn features [16, 21, 22] or distance metrics by exploit-

ing unlabeled and/or manually labeled data. Recently, deep

learning methods have also shown significant performance

∗Equal Contribution†This work was done while AL was a visiting student at UC Riverside.

𝐶!

𝐶"

𝐶#

𝐶!

𝐶"

𝐶#

𝑀"#

𝑀!"

𝑀!#

𝐶!

𝐶"

𝐶#

𝑀"#

𝑀!"

𝑀!#

𝐶$

𝑀$#= ?𝑀

$" = ?

𝑀$!= ?

Source data used for pairwise

training of existing network

Provided pairwise metrics

without access to source data

New limited labeled data

(colors indicate

corresponding camera)

New camera with new limited

pairwise labeled data with

existing cameras

Learned models , source

data discarded

Onboard new camera

(𝐶$)

Figure 1: Consider a three camera (C1, C2 and C3) net-

work, where we have only three pairwise distance metrics

(M12, M23 and M13) available for matching persons, and

no access to the labeled data due to privacy concerns. A new

camera, C4, needs to be added into the system quickly, thus,

allowing us to have only very limited labeled data across

the new camera and the existing ones. Our goal in this pa-

per is to learn the pairwise distance metrics (M41, M42 and

M43) between the newly inserted camera(s) and the existing

cameras, using the learned source metrics from the existing

network and a small amount of labeled data available after

installing the new camera(s).

improvement on person re-id [1, 15, 31, 32, 44, 52]. How-

ever, with the notable exception of [25, 26], most of these

works have not yet considered the dynamic nature of a cam-

era network, where new cameras can be introduced at any

time to cover a certain related area that is not well-covered

by the existing network of cameras. To build a more scal-

able person re-identification system, it is very essential to

consider the problem of how to on-board new cameras into

an existing network with little additional effort.

Let us consider K number of cameras in a network for

which we have learned(K2

)number of optimal pairwise

112144

matching metrics, one for each camera pair (see Figure 1

for an illustrative example). However, during an opera-

tional phase of the system, new camera(s) may be temporar-

ily introduced to collect additional information, which ide-

ally should be integrated with minimal effort. Given newly

introduced camera(s), the traditional re-id methods aim to

re-learn the pairwise matching metrics using a costly train-

ing phase. This is impractical in many situations where the

newly added camera(s) need to be operational soon after

they are added. In this case, we cannot afford to wait a long

time to obtain significant amount of labeled data for learn-

ing pairwise metrics, thus, we only have limited labeled data

of persons that appear in the entire camera network after ad-

dition of the new camera(s).

Recently published works [25, 26] attempt to address the

problem of on-boarding new cameras to a network by utiliz-

ing old data that were collected in the original camera net-

work, combined with newly collected data in the expanded

network, and source metrics to learn new pairwise met-

rics. They also assume the same set of people in all camera

views, including the new camera (i.e., before and after on-

boarding new cameras) for measuring the view similarity.

However, this is unrealistic in many surveillance scenarios

as source camera data may have been lost or not accessible

due to privacy concerns. Additionally, new people may ap-

pear after the target camera is installed who may or may not

have appeared in existing cameras. Motivated by this obser-

vation, we pose an important question: How can we swiftly

on-board new camera(s) in an existing re-id framework (i)

without having access to the source camera data that the

original network was trained on, and (ii) relying upon only

a small amount of labeled data during the transient phase,

i.e., after adding the new camera(s).

Transfer learning, which focuses on transferring knowl-

edge from a source to a target domain, has recently been

very successful in various computer vision problems [18,

23, 30, 46, 49]. However, knowledge transfer in our sys-

tem is challenging, because of limited labeled data and ab-

sence of source camera data while on-boarding new cam-

eras. To solve these problems, we develop an efficient

model adaptation approach using hypothesis transfer learn-

ing that aims to transfer the knowledge using only source

models (i.e., learned metrics) and limited labeled data, but

without using any original source camera data. Only a few

labeled identities that are seen by the target camera, and

one or more of the source cameras, are needed for effective

transfer of source knowledge to the newly introduced target

cameras. Henceforth, we will refer to this as target data.

Furthermore, unlike [25, 26], which identify only one best

source camera that aligns maximally with the target camera,

our approach focuses on identifying an optimal weighted

combination of multiple source models for transferring the

knowledge.

Our approach works as follows. Given a set of pairwise

source metrics and limited labeled target data after adding

the new camera(s), we develop an efficient convex opti-

mization formulation based on hypothesis transfer learn-

ing [4, 13] that minimizes the effect of negative transfer

from any outlier source metric while transferring knowl-

edge from source to the target cameras. More specifically,

we learn the weights of different source metrics and the op-

timal matching metric jointly by alternating minimization,

where the weighted source metric is used as a biased reg-

ularizer that aids in learning the optimal target metric only

using limited labeled data. The proposed method, essen-

tially, learns which camera pairs in the existing source net-

work best describe the environment that is covered by the

new camera and one of the existing cameras. Note that our

proposed approach can be easily extended to multiple addi-

tional cameras being introduced at a time in the network or

added sequentially one after another.

1.1. Contributions

We address the problem of swiftly on-boarding new

camera(s) into an existing person re-identification network

without having access to the source camera data, and rely-

ing upon only a small amount of labeled target data in the

transient phase, i.e., after adding the new cameras. Towards

solving the problem, we make the following contributions.

• We propose a robust and efficient multiple metric hy-

pothesis transfer learning algorithm to efficiently adapt

a newly introduced camera to an existing person re-id

framework without having access to the source data.

• We theoretically analyse the properties of our algo-

rithm and show that it minimizes the risk of negative

transfer and performs closely to fully supervised case

even when a small amount of labeled data is available.

• We perform rigorous experiments on multiple bench-

mark datasets to show the effectiveness of our pro-

posed approach over existing alternatives.

2. Related Work

Person Re-identification. Most of the methods in person

re-id are based on supervised learning. These methods ap-

ply extensive training using lots of manually labeled train-

ing data, and can be broadly classified in two categories: (i)

Distance metric learning based [9, 12, 16, 37, 45, 47] (ii)

Deep learning based [1, 28, 33, 40, 44, 52, 53]. Distance

metric learning based methods tend to learn distance met-

rics for camera pairs using pairwise labeled data between

those cameras, whereas end-to-end Deep learning based

methods tend to learn robust feature representations of the

persons, taking into consideration all the labeled data across

12145

all the cameras at once. To overcome the problem of man-

ual labeling, several unsupervised [17, 18, 34, 43, 47, 48]

and semi-supervised [5, 38, 39, 41] methods have been de-

veloped over the past decade. However, these methods do

not consider the case where new cameras are added to an

existing network. The most recent approach in this direc-

tion [25, 26] has considered unsupervised domain adapta-

tion of the target camera by making a strong assumption of

accessibility of the source data. None of these methods have

considered the fact of not having access to the source data

in the dynamic camera network setting. This is relevant, as

source camera data might have been deleted after a while

due to privacy concerns.

Hypothesis Transfer Learning. Hypothesis transfer learn-

ing [4, 13, 19, 24, 42] is a type of transfer learning that uses

only the learned classifiers from a source domain to effi-

ciently learn a classifier in the target domain, which con-

tains only limited labeled data. This approach is practically

appealing as it does not assume any relationship between

source and target distribution, nor the availability of source

data, which may be non accessible [13]. Most of the litera-

ture has dealt with simple linear classifiers for transferring

knowledge [13, 35]. One recent work [27] has addressed the

problem of transferring the knowledge of a source metric,

which is a positive semi-definite matrix, with some provable

guarantees. However, it has been analyzed for only a single

source metric and the weight of the metric is calculated by

minimizing a cost function using sub-gradient descent from

the generalization bound separately, which is a highly non-

convex non-differential function. In [35], the method has

addressed transfer of multiple linear classifiers in an SVM

framework, where the corresponding weights are calculated

jointly with the target classifiers in a single optimization.

Unlike these approaches, our approach addresses the case

of transfer from multiple source metrics by jointly optimiz-

ing for target metric, as well as the source weights to reduce

the risk of negative transfer.

3. Methodology

Let us consider a camera network with K cameras for

which we have learned a total N =(K2

)pairwise metrics

using extensive labeled data. We wish to install some new

camera(s) in the system that need to be operational soon

after they are added, i.e., without collecting and labeling

lots of new training data. We do not have access to the

old source camera data, rather, we only have the pairwise

source distance metrics. Moreover, we also have access to

only a limited amount of labeled data across the target and

different source cameras, which is collected after installing

the new cameras. Using the source metrics and the limited

pairwise source-target labeled data, we propose to solve a

constrained convex optimization problem (Eq. 1) that aims

to transfer knowledge from the source metrics to the target

efficiently while minimizing the risk of negative transfer.

Formulation. Suppose we have access to the optimal

distance metric Mab ∈ Rd×d for the a and b-th camera

pair of an existing re-id network, where d is the dimen-

sion of the feature representation of the person images and

a, b ∈ {1, 2 . . .K}. We also have limited pairwise labeled

data {(xij , yij)}Ci=1 between the target camera τ and the

source camera s, where xij = (xi − xj) is the feature dif-

ference between image i in camera τ and image j in camera

s, C =(nτs

2

), where nτs is the total number of ordered pair

images across cameras τ and s, and yij ∈ {−1, 1}. yij = 1if the persons i and j are the same person across the cam-

eras, and −1 otherwise. Note that our approach does not

need the presence of every person seen in the new target

camera across all the source cameras; rather, it is enough for

some people in the target camera to be seen in at least one

of the source cameras, in order to compute the new distance

metric across source-target pairs. Let S and D be defined

as S = {(i, j) | yij = 1} and D = {(i, j) | yij = −1}.

Our main goal is to learn the optimal metric between target

and each of the source cameras by using the information

from all the pairwise source metrics {Mj}Nj=1 and limited

labeled data {(xij , yij)}Ci=1. In standard metric learning

context, the distance between two feature vectors xi ∈ Rd

and xj ∈ Rd with respect to a metric M ∈ R

d×d is calcu-

lated by√

(xi − xj)⊤M(xi − xj).Thus, we formulate the following optimization problem

for calculating the optimal metric Mτs between target cam-

era τ and the s-th source camera, with ns and nd number of

similar and dissimilar pairs, as follows:

minimizeMτs, β

1

ns

∑

(i,j)∈S

x⊤ijMτsxij + λ‖Mτs −

N∑

j=1

βjMj‖2F

subject to1

nd

∑

(i,j)∈D

(x⊤ijMτsxij)− b ≥ 0, Mτs � 0,

β ≥ 0, ‖β‖2≤ 1(1)

The above objective consists of two main terms. The first

term is the normalized sum of distances of all similar pair

of features between camera τ and s with respect to the

Mahalanobis metric Mτs, and the second term represents

the Frobenius norm of the difference of Mτs and weighted

combination of source metrics squared. λ is a regularization

parameter to balance the two terms. Note that the second

term in Eq. 1 is essentially related to hypothesis transfer

learning [4, 13] where the hypotheses are the source met-

rics. The first constraint represents that the normalized sum

of distances of all dissimilar pairs of features with respect to

Mτs is greater than a user defined threshold b, and the sec-

ond constraints the distance metrics to always lie in the pos-

12146

itive semi-definite cone. While the third constraint keeps all

the elements of the source weight vector non-negative, the

last constraint ensures that the weights should not deviate

much from zero (through upper-bounding the ℓ-2 norm by

1).

Notation. We use the following notations in the optimiza-

tion steps.

(a) C1 = {M ∈ Rd×d | 1

nd

∑(i,j)∈D

(x⊤ijMxij)− b ≥ 0}

(b) C2 = {M ∈ Rd×d | M � 0}

(c) C3 = {β ∈ RN | β ≥ 0 ∩ ‖β‖2 ≤ 1}

Optimization. The proposed optimization problem (1) is

not jointly convex over Mτs and β. To solve this nonconvex

optimization over large size matrices, we devise an iterative

algorithm to efficiently solve (1) by alternatively solving for

two sub-problems. For the sake of brevity, we denote Mτs

as M in the subsequent steps. Specifically, in the first step,

we fix the weight β and take a gradient step with respect to

M in the descent direction with step size α (Eq. 2). Then,

we project the updated M onto C1 and C2 in an alternating

fashion until convergence (Eq. 3 and Eq. 4). In the next

step, we fix the the updated M and take a step with size

γ towards the direction of negative gradient with respect

to β (Eq. 6). In the last step, we simply project β onto

the set C3 (Eq. 7). Algorithm 1 summarizes the alternating

minimization procedure to optimize (1). We briefly describe

these steps below and refer the reader to the supplementary

material for more mathematical details.

Algorithm 1: Algorithm to Solve Eq. 1

Input: Source metric {Mj}Nj=1, {(xij , yij)}

Ci=1

Output: Optimal metric M⋆

Initialization: Mk, βk, k = 0;

while convergence do

Mk+1 = Mk − α∇Mf(M,βk)|M=Mk (Eq. 2);

while convergence do

Mk+1 = ΠC1(Mk+1) (Eq. 3);

Mk+1 = ΠC2(Mk+1) (Eq. 4);

end

βk+1 = βk − γ∇β(f(Mk+1, β)|β=βk (Eq. 6);

βk+1 = ΠC3(βk+1) (Eq. 7);

k = k + 1 ;

end

Step 1: Gradient w.r.t M with fixed β.

With k being the iteration number and Mk, βk being M

and β in the k-th iteration, we compute the gradient of the

objective function (1) with respect to M by fixing β = βk

at the k-th iteration as follows:

∇Mf(M,βk)|M=Mk = ΣS +2λ(Mk −

N∑

j=1

βkj Mj), (2)

where ΣS = 1ns

∑(i,j)∈S

xijx⊤ij and

f(M,βk) = 1ns

∑(i,j)∈S

x⊤ijMxij + λ‖M −

N∑j=1

βkj Mj‖

2F .

Step 2: Projection of M onto C1 and C2. The projection

of M onto C1 (denoted as ΠC1(M)) can be computed by

solving a constrained optimization as follows:

ΠC1(M) =arg min

M

1

2‖M −M‖2F

Subject to1

nd

∑

(i,j)∈D

(x⊤ijMxij)− b ≥ 0

By writing the Lagrange for the above constrained opti-

mization and using KKT conditions with strong duality, the

projection of M onto C1 can be written as

ΠC1(M) = M+max

0,

(b− 1

nd

∑(i,j)∈D

x⊤ijMxij

)

‖ΣD‖2F

ΣD,

(3)

where ΣD = 1nd

∑(i,j)∈D

xijx⊤ij . Similarly, using spectral

value decomposition, the projection of M onto C2 can be

written as

ΠC2(M) = V diag(

[λ1 λ2 . . . λn

])V ⊤, (4)

where V is the eigenvector matrix of M , λi is the i-th eigen-

value of M and λj = max{λj , 0} ∀ j ∈[1 . . . d

].

Step 3: Gradient w.r.t β with fixed M . By fixing M =Mk+1 in the objective function, differentiating it w.r.t βi,

the i-th element of β at the point β = βk, we get

∇βi(f(Mk+1, β))|βi=βk

i= 2λβk

i trace(M⊤i Mi)−

2λtrace(M⊤i (Mk+1 −

N∑

j=1,j 6=i

βkj Mj))

(5)

By denoting ∇βi(f(Mk+1, β))|βi=βk

ias aki , we get

∇β(f(Mk+1, β))|β=βk =

[ak1 ak2 . . . akN

]⊤(6)

Step 4: Projection of β onto C3. This step essentially

projects a vector to the first quadrant of an N -dimensional

unit norm hyper-sphere. The closed form expression of the

projection onto C3 is as follows:

ΠC3(βk+1) = max

{0,

βk+1

max{1, ‖βk+1‖2}

}(7)

12147

4. Discussion and Analysis

One of the key differences between our approach and

existing methods is that the nature of our problem deals

with the multiple metric setting within the hypothesis trans-

fer learning framework. In this section, following [27], we

theoretically analyze the properties of our Algorithm 1 for

transferring knowledge from multiple metrics.

Let T be a domain defined over the set (X × Y) where

X ⊆ Rd and Y ∈ {−1, 1} denote the feature and label set,

respectively, and has a probability distribution denoted by

DT . Let T be the target domain defined by {(xi, yi)}ni=1

consisting of n i.i.d samples, each drawn from the distribu-

tion DT . The optimization proposed in Eq.1 of [27] (page.

2) is defined as:

minimizeM�0

LT (M) + λ‖M −MS‖2F (8)

Fixing the value of β in our proposed optimization (1),

we have an optimization problem equivalent to (8), where

MS =∑N

j=1 βjMj and

LT (M) =1

ns

∑

(i,j)∈S

x⊤ijMxij+µ⋆

(b−

1

nd

∑

(i,j)∈D

x⊤ijMxij

)

(9)

Note that µ⋆ in Eq. 9 is the optimal dual variable for the

inequality constraint optimization (1) with the weight vector

fixed. Clearly, the expression is linear, hence convex in M ,

and has a finite lipschitz constant k.

Theorem 1. For the convex and k-Lipschitz loss (shown in

supp) defined in (9) the average bound can be expressed as

ET∼DT n [LDT(M⋆)] ≤ LDT

(MS) +8k2

λn, (10)

where n is the number of target labeled examples, M⋆

is the optimal metric computed from Algorithm 1, MS

is the average of all source metrics defined as

∑Nj=1

Mj

N,

ET∼DT n [LDT(M⋆)] is the expected loss by M⋆ computed

over distribution DT and LDT(MS) is the loss of average

of source metrics computed over DT .

Proof. The proof is given in supplementary material.

Implication of Theorem 1: Since we transfer knowledge

from multiple source metrics, and do not know which is

the most generalizable over the target distribution (i.e., the

best source metric), the most sensible thing is to check

for the average performance of using each of the source

metrics directly over the target test data. It is equivalently

giving all the source metrics equal weights and not using

any of the target data for training purpose. The bound in

Theorem (9) shows that, on average, the metric learned

form Algorithm 1 tends to do better than, or in worst case,

at least equivalent to the average of source metrics with

a fast convergence rate of O( 1n) with limited number of

target samples [27].

Theorem 2. With probability (1 − δ), for any metric M

learned from Algorithm 1 we have,

LDT(M) ≤LT (M) +O

( 1n

)+

(√LT (

∑Nj=1 βjMj)

λ+‖

N∑

j=1

βjMj‖F

)√ln( 2

δ)

2n,

(11)

where LDT(M) is the loss over the original target distri-

bution (true risk), LT (M) is the loss over the existing target

data (empirical risk), and n is the number of target samples.

Proof. See the supplementary material for the proof.

Implication of Theorem 2: This bound shows that given

only a small amount of labeled target data, our method per-

forms closely to the fully supervised case. The right hand

side of the inequality (11) consists of the term O(1n

)+

Φ(β)O(

1√n

). Since the optimal weight β⋆ from optimiza-

tion (1) will be sparse due to the way β is constrained, zero

weights will automatically be assigned to the outlier met-

rics, i.e., outlier Mjs, resulting in zero values for the terms

β⋆kLT (Mj) corresponding to those indices j and hence

smaller value of Φ(β). As a result, the O(

1√n

)term will be

less dominant in (11) than O(1n

), due to smaller associated

coeffiecient Φ(β⋆) and, hence, can be ignored. Thus, due

to the faster decay rate of O(1n

), this implies that with very

limited target data, the empirical risk will converge to the

true risk. Furthermore, when n is very large (the fully su-

pervised case), O(

1√n

)will be close to zero and cannot be

altered by multiplication with any coefficient. This implies

that the source metrics will not have any effect on learning

when there is enough labeled target data available and are

only useful in the presence of limited data as in our applica-

tion domain.

Negative Transfer: In optimization (1), we jointly estimate

the optimal metric, as well as the weight vector, which de-

termines which source to transfer from and with how much

weight. If a source metric is not a good representative of the

target distribution, for an optimal λ, the weight associated

to this metric will automatically be set to zero or close to

zero by optimization (1), due to the sparsity constraint of β.

Hence, our approach minimizes the risk of negative transfer.

5. Experiments

Datasets. We test the effectiveness of our method by ex-

perimenting on four publicly available person re-id datasets

such as WARD [20], RAiD [2], Market1501 [50], and

12148

5 10 15 20

Rank

0

20

40

60

80

100

Rec

og

nit

ion

rat

e [%

]

CMC Curves - WARD dataset

Average across camera pairs with all cameras as target

Ours (nAUC: 94.31)

Adapt-GFK (nAUC: 90.32)

Avg-source (nAUC: 87.26)

CAMEL (nAUC: 75.84)

Best-GFK (nAUC: 82.39)

Direct-GFK (nAUC: 81.73)

(a)

5 10 15 20

Rank

0

20

40

60

80

100

Rec

ognit

ion r

ate

[%]

CMC Curves - RAiD dataset


Ours (nAUC: 88.67)


Avg-Source (nAUC: 81.83)

CAMEL (nAUC: 71.43)



(b)

5 10 15 20

Rank

0

10

20

30

40

50

60

Rec

og

nit

ion

rat

e [%

]

CMC Curves - Market1501 dataset


Ours (nAUC: 93)

Adapt-GFK (nAUC: 88)

Avg-Source (nAUC: 90)

CAMEL (nAUC: 86)

Best-GFK (nAUC: 78)

Direct-GFK (nAUC: 81)

(c)

5 10 15 20

Rank

0

5

10

15

20

25

30

Rec

ognit

ion r

ate

[%]

CMC Curves - MSMT17 dataset


Ours (nAUC: 58)

Adapt-GFK (nAUC: 52)

Avg-Source (nAUC: 51)

CAMEL (nAUC: 50)

Best-GFK (nAUC: 46)

Direct-GFK (nAUC: 49)

(d)

Figure 2: CMC curves averaged over all target camera combinations, introduced one at a time. (a) WARD with 3 cameras,

(b) RAiD with 4 cameras, (c) Market1501 with 6 cameras and (d) MSMT17 with 15 cameras. Best viewed in color.

MSMT17 [36]. There are several other re-id datasets like

ViPeR [8], PRID2011 [11] and CUHK01 [14]; however,

those do not apply in our case due to availability of only

two cameras. RAiD and WARD are smaller datasets with

43 and 70 persons captured in 4 and 3 cameras, respec-

tively, whereas Market1501 and MSMT17 are more recent

and large datasets with 1,501 and 4,101 persons captured

across 6 and 15 cameras, respectively.

Feature Extraction and Matching. We use Local Maxi-

mal Occurrence (LOMO) feature [16] of length 29, 960 in

RAiD and WARD datasets. However, since LOMO usually

performs poorly on large datasets [7], for Market1501 and

MSMT17 we extract features from the last layer of an Im-

agenet [3] pre-trained ResNet50 network [10] (denoted as

IDE features in our work). We follow standard PCA tech-

nique to reduce the feature dimension to 100, as in [12, 25].

Performance Measures. We provide standard Cumulative

Matching Curves (CMC) and normalized Area Under Curve

(nAUC), as is common in person re-id [2, 12, 16, 26]. While

the former shows accumulated accuracy by considering the

k-most similar matches within a ranked list, the latter is a

measure of re-id accuracy, independent on the number of

test samples. Due to the space constraint, we only report

average CMC curves for most experiments and leave the

full CMC curves in the supplementary material.

Experimental Settings. For RAiD we follow the protocol

in [16] and randomly split the persons into a training set of

21 persons and a test set of 20 persons, whereas for WARD,

we randomly split the 70 persons into a set of 35 persons for

training and rest 35 persons for testing. For both datasets,

we perform 10 train/test splits and average accuracy across

all splits. We use the standard training and testing splits for

both Market1501 and MSMT17 datasets. During testing,

we follow a multi-query approach by averaging all query

features of each id in the target camera and compare with

all features in the source camera [50].

Compared Methods. We compare our approach with the

following methods. (1) Two variants of Geodesic Flow Ker-

nel (GFK) [6] such as Direct-GFK where the kernel be-

tween a source-target camera pair is directly used to eval-

uate the accuracy and Best-GFK where GFK between the

best source camera and the target camera is used to eval-

uate accuracy between all source-target camera pairs as in

[25, 26]. Both methods use the supervised dimensionality

reduction method, Partial Least Squares (PLS), to project

features into a low dimensional subspace [25, 26]. (2) State-

of-the-art method for on-boarding new cameras [25, 26] that

uses transitive inference over the learned GFK across the

best source and target camera (Adapt-GFK). (3) Clustering-

based Asymmetric MEtric Learning (CAMEL) method of

[47], which projects features from source and target cam-

era to a shared space using a learned projection matrix. For

all compared methods, we use their publicly available code

and perform evaluation in our setting.

5.1. Onboarding a Single New Camera

We consider one camera as newly introduced target cam-

era and all the other as source cameras. We consider all the

possible combinations for conducting experiments. In addi-

tion to the baselines described above, we compare against

the accuracy of average of the source metrics (Avg-Source)

by applying it directly over the target test set to prove the va-

lidity of Theorem 1. We also compute the GFK kernels in

two settings; by considering only target data available after

introducing the new cameras (Figure 2) and by considering

the presence of both old source data and the new labeled

data after camera installation as in [25, 26] (Figure 3).

Implementation details. We split training data into disjoint

source and target data considering the fact that the persons

that appear in the new camera after installation may or may

not be seen before in the source cameras. That is, for Mar-

ket1501 and MSMT17, we split the training data into 90%

of persons that are only seen by the source cameras and

10% that are seen in both source cameras and the new tar-

get camera after the installation. Since there are much fewer

persons in RAiD and WARD training set, we split the per-

sons into 80% source and 20% target for those two datasets.

For each dataset, we evaluate every source-target pair and

average accuracy across all pairs. Furthermore, we average

accuracy across all cameras as target. Note that the train

12149

5 10 15 20

Rank

0

20

40

60

80

100R

ecognit

ion r

ate

[%]

CMC Curves - WARD dataset


Ours (nAUC: 94.31)




Figure 3: CMC curves averaged over all the target cam-

era combinations, introduced one at a time, on the WARD

dataset. Note that both old and new source data are used for

calculation of GFK. Best viewed in color.

and test set are kept disjoint in all our experiments.

Results. Figure 2 and 3 show the results. In all cases, our

method outperforms all the compared methods. The most

competitive methods are those of Adapt-GFK and Avg-

Source that also use source metrics. For the remaining

methods, we see the limitation of only using limited tar-

get data to compute the new metrics. For Market1501, we

see that Avg-Source outperforms the Adapt-GFK baseline

indicating the advantage of knowledge transfer from multi-

ple source metric compared to one single best source met-

ric as in [25, 26]. However, our approach still outperforms

the Avg-Source baseline by a margin of 20.60%, 13.81%,

2.01% and 1.07% in Rank-1 accuracy on RAiD, WARD,

Market1501 and MSMT17, respectively, validating our im-

plications of Theorem 1. Furthermore, we observe that even

without accessing the source training data that was used

for training the network before adding a new camera, our

method outperforms the GFK based methods that use all the

source data in their computations (see Figure 3). To sum-

marize, the experimental results show that our method per-

forms better on both small and large camera networks with

limited supervision, as it is able to adapt multiple source

metrics through reducing negative transfer by dynamically

weighting the source metrics.

5.2. Onboarding Multiple New Cameras

We perform this experiment on Market1501 dataset us-

ing the same strategy as in Section 5.1 and compare our re-

sults with other methods while adding multiple target cam-

eras to the network, either continuously or in parallel.

Parallel On-boarding of Cameras: We randomly se-

lect two or three cameras as target while keeping the re-

maining as source. All the new target cameras are tested

against both source cameras and other target cameras. The

results of adding two and three cameras in parallel (at the

same time) are shown in Figure 4 (a) and (b), respectively.

In both cases, our method outperforms all the compared

methods with an increasing margin as rank increases. We

outperform the most competitive CAMEL in Rank-1 accu-

racy by 5.45% and 3.73%, while adding two and three cam-

eras respectively. Furthermore, our method better adapts

source metrics since it has the capability of assigning zero

weights to the metrics that do not generalize well over target

data. Meanwhile, Adapt-GFK has a high probability of us-

ing the outlier source metrics in the presence of fewer avail-

able source metrics, which causes negative transfer. This

has been shown in Figure 4 where GFK based methods are

performing worse than CAMEL, which is computed just

with limited supervision without using any source metrics.

Sequential On-boarding of Cameras: For this exper-

iment, we randomly select three target cameras that are

added sequentially. A target camera is tested against all

source cameras and previously added target cameras. The

results are shown in Figure 4 (c). Similar to parallel on-

boarding, our methods outperforms compared methods by

a large margin. In this setting, we outperform CAMEL

by 8.22% in Rank-1 accuracy. Additionally, compared to

all GFK-based methods, the Rank-1 margin is kept con-

stant at 10% for both parallel and sequential on-boarding.

These results show the scalability of our proposed method

while adding multiple cameras to a network, irrespective of

whether they are added in parallel or sequentially.

5.3. Different Labeled Data in New Cameras

We perform this experiment to show the implications of

Theorem 2 by using different percentages of labeled tar-

get data (10%, 20%, 30%, 50%, 75% and 100%) in our

method. We compare with a widely used KISS metric learn-

ing (KISSME) [12] algorithm and show the difference in

Rank-1 accuracy as a function of labeled target data. Fig-

ure 5 (a) shows the results. At only 10% labeled data, the

difference between our method and KISSME [12] is almost

30%; however, as we add more labeled data, the Rank-1

accuracy becomes equivalent for the two methods at 100%

labeled data. This confirms the implications of Theorem 2,

where we showed that with increasing labeled target data,

the effect of source metrics in learning becomes negligible.

5.4. Finetuning with Deep Features

This section shows the strength of our method while

comparing with CNN features extracted from a network

trained on the source data (we train a ResNet50 model [10],

pretrained on the Imagenet dataset). Without transfer learn-

ing, we have two options: (a) directly use the source model

to extract features in the target and do matching based on

Euclidean/KISSME metric (IDE), (b) finetune the source

model using limited target data and then extract features

to do matching using Euclidean/KISSME (finetuned). We

12150

2 4 6 8 10 12 14 16 18 20

Rank

0

10

20

30

40

50

60

70R

eco

gn

itio

n r

ate

[%]


Average across camera pairs when Cameras 4 and 5 are target

Ours (nAUC: 93.17)


CAMEL (nAUC: 88.66)



(a)

2 4 6 8 10 12 14 16 18 20

Rank

0

10

20

30

40

50

60

70

Rec

og

nit

ion

rat

e [%

]


Average across camera pairs when Cameras 1, 3 and 4 are target

Ours (nAUC: 93.20)


CAMEL (nAUC: 88.86)



(b)

2 4 6 8 10 12 14 16 18 20

Rank

0

10

20

30

40

50

60

70

Rec

og

nit

ion

rat

e [%

]


Average across camera pairs when Cameras 1, 2 and 6 are target

Ours (nAUC: 92.76)


CAMEL (nAUC: 84.83)



(c)

Figure 4: CMC curves averaged across target cameras on Market1501 dataset. (a) and (b) show results while adding two and

three cameras in parallel, (c) show result while adding three cameras sequentially one after another. Best viewed in color.

10 20 30 50 75 100

Percentage labeling [%]

0

10

20

30

40

50

60

70

Ran

k-1

acc

ura

cy [

%]

Ours

KISSME

Difference

(a)

2 4 8 10 20

Percentage labeling [%]

0

10

20

30

40

50

Ran

k-1

acc

ura

cy [

%]

Ours (IDE)

KISSME (IDE)

Euclidean (IDE)

Ours (finetuned)

KISSME (finetuned)

Euclidean (finetuned)

(b)

10-5

10-4

10-3

10-2

10-1

16

18

20

22

24

26

28

30

32

34

Ran

k-1

acc

ura

cy [

%]

10% labeling

2% labeling

(c)

Figure 5: (a) Effect of different percentage of target labelling on WARD dataset for justifying Theorem 2, (b) Analysis of our

method with deep features trained on source camera data in Market1501 dataset with 6th camera as target, (c) Sensitivity of

λ on the Rank-1 performance tested using deep features in Market1501 with 6th camera as target. Best viewed in color.

compared these baselines with our method with different

percentage of labeling on Market1501 dataset, where the

pairwise metrics are computed using the source features

extracted from the model without any finetuning. We use

those source metrics along with the target features, ex-

tracted before (Ours(IDE)) and after finetuning the source

model (Ours(finetuned)). Please see supplementary ma-

terial for more details. Figure 5 (b) shows the results.

Ours(IDE) outperforms Euclidean(IDE) by a margin of

10% on Market with 20% of labeled target data. The dif-

ference between Ours(finetuned) and Euclidean/KISSME

(finetuned) is more noticeable with less labeled data and it

becomes smaller with increase in labeled target data (Theo-

rem 2). However Ours(finetuned) consistently outperforms

all the other baselines for up to 20% labeling.

5.5. Parameter Sensitivity

We perform this experiment to study the effect of λ in op-

timization (1) for a given percentage of labeled target data.

Figure 5 (c) shows the Rank-1 accuracy of our proposed

method for different values of λ. From optimization 1,

when λ → ∞ the left term can be neglected resulting in

optimal M and β to be zero. However, when λ → 0, the

regularization term is neglected resulting in no transfer. We

can see from Figure 5 (c) that there is an operating zone of

λ (e.g., in the range of 10−4 to 10−2), that is neither too

high nor too low for useful transfer from source metrics.

6. Conclusions

We addressed a critically important problem in person

re-identification which has received little attention thus far -

how to quickly on-board new cameras into an existing cam-

era network. We showed this can be addressed effectively

using hypothesis transfer learning using only learned source

metrics and a limited amount of labeled data collected after

installing the new camera(s). We provided theoretical anal-

ysis to show that our approach minimizes the effect of neg-

ative transfer through finding an optimal weighted combi-

nation of multiple source metrics. We showed the effective-

ness of our approach on four standard datasets, significantly

outperforming several baseline methods.

Acknowledgements. This work was partially supported

by ONR grant N00014-19-1-2264 and NSF grant 1724341.

12151

References

[1] Ejaz Ahmed, Michael Jones, and Tim K Marks. An improved

deep learning architecture for person re-identification. In

CVPR, pages 3908–3916, 2015. 1, 2[2] Abir Das, Anirban Chakraborty, and Amit K Roy-

Chowdhury. Consistent re-identification in a camera net-

work. In ECCV, pages 330–345. Springer, 2014. 5, 6[3] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li,

and Li Fei-Fei. Imagenet: A large-scale hierarchical image

database. In CVPR, pages 248–255. Ieee, 2009. 6[4] Simon S Du, Jayanth Koushik, Aarti Singh, and Barnabas

Poczos. Hypothesis transfer learning via transformation

functions. In Advances in Neural Information Processing

Systems, pages 574–584, 2017. 2, 3[5] Hehe Fan, Liang Zheng, Chenggang Yan, and Yi Yang.

Unsupervised person re-identification: Clustering and fine-

tuning. ACM Transactions on Multimedia Computing, Com-

munications, and Applications (TOMM), 14(4):83, 2018. 3[6] Boqing Gong, Yuan Shi, Fei Sha, and Kristen Grauman.

Geodesic flow kernel for unsupervised domain adaptation.

In CVPR, pages 2066–2073. IEEE, 2012. 6[7] Mengran Gou, Ziyan Wu, Angels Rates-Borras, Octavia

Camps, Richard J Radke, et al. A systematic evaluation

and benchmark for person re-identification: Features, met-

rics, and datasets. IEEE transactions on pattern analysis and

machine intelligence, 41(3):523–536, 2018. 1, 6[8] Douglas Gray and Hai Tao. Viewpoint invariant pedestrian

recognition with an ensemble of localized features. In ECCV,

pages 262–275. Springer, 2008. 6[9] Matthieu Guillaumin, Jakob Verbeek, and Cordelia Schmid.

Is that you? metric learning approaches for face identifica-

tion. In CVPR, pages 498–505. IEEE, 2009. 2[10] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.

Deep residual learning for image recognition. In CVPR,

pages 770–778, 2016. 6, 7[11] Martin Hirzer, Csaba Beleznai, Peter M Roth, and Horst

Bischof. Person re-identification by descriptive and discrim-

inative classification. In Scandinavian conference on Image

analysis, pages 91–102. Springer, 2011. 6[12] Martin Koestinger, Martin Hirzer, Paul Wohlhart, Peter M

Roth, and Horst Bischof. Large scale metric learning from

equivalence constraints. In CVPR, pages 2288–2295. IEEE,

2012. 2, 6, 7[13] Ilja Kuzborskij and Francesco Orabona. Stability and hy-

pothesis transfer learning. In ICML, pages 942–950, 2013.

2, 3[14] Wei Li, Rui Zhao, and Xiaogang Wang. Human reidentifica-

tion with transferred metric learning. In ACCV, pages 31–44.

Springer, 2012. 6[15] Wei Li, Xiatian Zhu, and Shaogang Gong. Harmonious at-

tention network for person re-identification. In CVPR, pages

2285–2294, 2018. 1[16] Shengcai Liao, Yang Hu, Xiangyu Zhu, and Stan Z. Li. Per-

son re-identification by local maximal occurrence represen-

tation and metric learning. In CVPR, pages 2197–2206, June

2015. 1, 2, 6[17] Yutian Lin, Xuanyi Dong, Liang Zheng, Yan Yan, and Yi

Yang. A bottom-up clustering approach to unsupervised per-

son re-identification. In AAAI, volume 33, pages 8738–8745,

2019. 3[18] Jianming Lv, Weihang Chen, Qing Li, and Can Yang. Un-

supervised cross-dataset person re-identification by transfer

learning of spatial-temporal patterns. In CVPR, pages 7948–

7956, 2018. 2, 3[19] Yishay Mansour, Mehryar Mohri, and Afshin Rostamizadeh.

Domain adaptation with multiple sources. In Advances in

neural information processing systems, pages 1041–1048,

2009. 3[20] Niki Martinel and Christian Micheloni. Re-identify people

in wide area camera network. In 2012 IEEE computer so-

ciety conference on computer vision and pattern recognition

workshops, pages 31–36. IEEE, 2012. 5[21] Tetsu Matsukawa, Takahiro Okabe, Einoshin Suzuki, and

Yoichi Sato. Hierarchical gaussian descriptor for person re-

identification. In CVPR, pages 1363–1372, 2016. 1[22] Tetsu Matsukawa, Takahiro Okabe, Einoshin Suzuki, and

Yoichi Sato. Hierarchical gaussian descriptors with applica-

tion to person re-identification. IEEE transactions on pattern

analysis and machine intelligence, 2019. 1[23] Hyeonwoo Noh, Taehoon Kim, Jonghwan Mun, and Bo-

hyung Han. Transfer learning via unsupervised task discov-

ery for visual question answering. In CVPR, pages 8385–

8394, 2019. 2[24] Francesco Orabona, Claudio Castellini, Barbara Caputo, An-

gelo Emanuele Fiorilla, and Giulio Sandini. Model adapta-

tion with least-squares svm for adaptive hand prosthetics. In

2009 IEEE International Conference on Robotics and Au-

tomation, pages 2897–2903. IEEE, 2009. 3[25] Rameswar Panda, Amran Bhuiyan, Vittorio Murino, and

Amit K Roy-Chowdhury. Unsupervised adaptive re-

identification in open world dynamic camera networks. In

CVPR, pages 7054–7063, 2017. 1, 2, 3, 6, 7[26] Rameswar Panda, Amran Bhuiyan, Vittorio Murino, and

Amit K Roy-Chowdhury. Adaptation of person re-

identification models for on-boarding new camera (s). Pat-

tern Recognition, 96:106991, 2019. 1, 2, 3, 6, 7[27] Michael Perrot and Amaury Habrard. A theoretical analysis

of metric hypothesis transfer learning. In ICML, pages 1708–

1717, 2015. 3, 5[28] Xuelin Qian, Yanwei Fu, Yu-Gang Jiang, Tao Xiang, and

Xiangyang Xue. Multi-scale deep learning architectures for

person re-identification. In ICCV, pages 5399–5408, 2017.

2[29] Amit K Roy-Chowdhury and Bi Song. Camera networks:

The acquisition and analysis of videos over wide areas. Syn-

thesis Lectures on Computer Vision, 3(1):1–133, 2012. 1[30] Ruoqi Sun, Xinge Zhu, Chongruo Wu, Chen Huang, Jian-

ping Shi, and Lizhuang Ma. Not all areas are equal: Transfer

learning for semantic segmentation via hierarchical region

selection. In CVPR, pages 4360–4369, 2019. 2[31] Yifan Sun, Qin Xu, Yali Li, Chi Zhang, Yikang Li, Shengjin

Wang, and Jian Sun. Perceive where to focus: Learn-

ing visibility-aware part-level features for partial person re-

identification. In CVPR, pages 393–402, 2019. 1[32] Chiat-Pin Tay, Sharmili Roy, and Kim-Hui Yap. Aanet: At-

tribute attention network for person re-identifications. In

CVPR, pages 7134–7143, 2019. 1[33] Guangcong Wang, Jianhuang Lai, Peigen Huang, and Xiao-

12152

hua Xie. Spatial-temporal person re-identification. In AAAI,

volume 33, pages 8933–8940, 2019. 2[34] Jingya Wang, Xiatian Zhu, Shaogang Gong, and Wei Li.

Transferable joint attribute-identity deep learning for unsu-

pervised person re-identification. In CVPR, pages 2275–

2284, 2018. 3[35] Yu-Xiong Wang and Martial Hebert. Learning by transfer-

ring from unsupervised universal sources. In AAAI, pages

2187–2193. 3[36] Longhui Wei, Shiliang Zhang, Wen Gao, and Qi Tian.

Person transfer gan to bridge domain gap for person re-

identification. In CVPR, pages 79–88, 2018. 6[37] Kilian Q Weinberger and Lawrence K Saul. Distance met-

ric learning for large margin nearest neighbor classification.

Journal of Machine Learning Research, 10(Feb):207–244,

2009. 2[38] Ancong Wu, Wei-Shi Zheng, Xiaowei Guo, and Jian-Huang

Lai. Distilled person re-identification: Towards a more scal-

able system. In CVPR, pages 1187–1196, 2019. 3[39] Yu Wu, Yutian Lin, Xuanyi Dong, Yan Yan, Wanli Ouyang,

and Yi Yang. Exploit the unknown gradually: One-shot

video-based person re-identification by stepwise learning. In

CVPR, pages 5177–5186, 2018. 3[40] Tong Xiao, Hongsheng Li, Wanli Ouyang, and Xiaogang

Wang. Learning deep feature representations with domain

guided dropout for person re-identification. In CVPR, pages

1249–1258, 2016. 2[41] Xiaomeng Xin, Jinjun Wang, Ruji Xie, Sanping Zhou, Wenli

Huang, and Nanning Zheng. Semi-supervised person re-

identification using multi-view clustering. Pattern Recog-

nition, 88:285–297, 2019. 3[42] Jun Yang, Rong Yan, and Alexander G Hauptmann. Cross-

domain video concept detection using adaptive svms. In

ACM MM, pages 188–197. ACM, 2007. 3[43] Qize Yang, Hong-Xing Yu, Ancong Wu, and Wei-Shi Zheng.

Patch-based discriminative feature learning for unsupervised

person re-identification. In CVPR, pages 3633–3642, 2019.

3[44] Wenjie Yang, Houjing Huang, Zhang Zhang, Xiaotang Chen,

Kaiqi Huang, and Shu Zhang. Towards rich feature discov-

ery with class activation maps augmentation for person re-

identification. In CVPR, pages 1389–1398, 2019. 1, 2[45] Xun Yang, Meng Wang, and Dacheng Tao. Person re-

identification with metric learning using privileged informa-

tion. IEEE Transactions on Image Processing, 27(2):791–

805, 2017. 2[46] Xi Yin, Xiang Yu, Kihyuk Sohn, Xiaoming Liu, and Man-

mohan Chandraker. Feature transfer learning for face recog-

nition with under-represented data. In CVPR, pages 5704–

5713, 2019. 2[47] Hong-Xing Yu, Ancong Wu, and Wei-Shi Zheng. Cross-

view asymmetric metric learning for unsupervised person re-

identification. In ICCV, pages 994–1002, 2017. 2, 3, 6[48] Hong-Xing Yu, Wei-Shi Zheng, Ancong Wu, Xiaowei Guo,

Shaogang Gong, and Jian-Huang Lai. Unsupervised person

re-identification by soft multilabel learning. In CVPR, pages

2148–2157, 2019. 3[49] Amir R Zamir, Alexander Sax, William Shen, Leonidas J

Guibas, Jitendra Malik, and Silvio Savarese. Taskonomy:

Disentangling task transfer learning. In CVPR, pages 3712–

3722, 2018. 2[50] Liang Zheng, Liyue Shen, Lu Tian, Shengjin Wang, Jing-

dong Wang, and Qi Tian. Scalable person re-identification:

A benchmark. In ICCV, pages 1116–1124, 2015. 5, 6[51] Liang Zheng, Yi Yang, and Alexander G Hauptmann. Per-

son re-identification: Past, present and future. arXiv preprint

arXiv:1610.02984, 2016. 1[52] Zhedong Zheng, Xiaodong Yang, Zhiding Yu, Liang Zheng,

Yi Yang, and Jan Kautz. Joint discriminative and generative

learning for person re-identification. In CVPR, pages 2138–

2147, 2019. 1, 2[53] Sanping Zhou, Jinjun Wang, Jiayun Wang, Yihong Gong,

and Nanning Zheng. Point to set similarity based deep fea-

ture learning for person re-identification. In CVPR, pages

3741–3750, 2017. 2

12153

camera on-boarding for person re-identification using … · 2020-06-29 · camera on-boarding for...

Documents