camera on-boarding for person re-identification using … · 2020-06-29 · camera on-boarding for...
TRANSCRIPT
Camera On-boarding for Person Re-identification using Hypothesis Transfer
Learning
Sk Miraj Ahmed1,∗, Aske R Lejbølle2,∗,†, Rameswar Panda2, Amit K. Roy-Chowdhury1
1 University of California, Riverside, 2 Aalborg University, Denmark, 3 IBM Research AI, Cambridge
{sahme047@, alejboel@, rpand002@, amitrc@ece}.ucr.edu
Abstract
Most of the existing approaches for person re-
identification consider a static setting where the number of
cameras in the network is fixed. An interesting direction,
which has received little attention, is to explore the dynamic
nature of a camera network, where one tries to adapt the ex-
isting re-identification models after on-boarding new cam-
eras, with little additional effort. There have been a few
recent methods proposed in person re-identification that at-
tempt to address this problem by assuming the labeled data
in the existing network is still available while adding new
cameras. This is a strong assumption since there may exist
some privacy issues for which one may not have access to
those data. Rather, based on the fact that it is easy to store
the learned re-identifications models, which mitigates any
data privacy concern, we develop an efficient model adapta-
tion approach using hypothesis transfer learning that aims
to transfer the knowledge using only source models and
limited labeled data, but without using any source cam-
era data from the existing network. Our approach mini-
mizes the effect of negative transfer by finding an optimal
weighted combination of multiple source models for trans-
ferring the knowledge. Extensive experiments on four chal-
lenging benchmark datasets with a variable number of cam-
eras well demonstrate the efficacy of our proposed approach
over state-of-the-art methods.
1. Introduction
Person re-identification (re-id), which addresses the
problem of matching people across different cameras, has
attracted intense attention in recent years [7, 29, 51]. Much
progress has been made in developing a variety of methods
to learn features [16, 21, 22] or distance metrics by exploit-
ing unlabeled and/or manually labeled data. Recently, deep
learning methods have also shown significant performance
∗Equal Contribution†This work was done while AL was a visiting student at UC Riverside.
𝐶!
𝐶"
𝐶#
𝐶!
𝐶"
𝐶#
𝑀"#
𝑀!"
𝑀!#
𝐶!
𝐶"
𝐶#
𝑀"#
𝑀!"
𝑀!#
𝐶$
𝑀$#= ?𝑀
$" = ?
𝑀$!= ?
Source data used for pairwise
training of existing network
Provided pairwise metrics
without access to source data
New limited labeled data
(colors indicate
corresponding camera)
New camera with new limited
pairwise labeled data with
existing cameras
Learned models , source
data discarded
Onboard new camera
(𝐶$)
Figure 1: Consider a three camera (C1, C2 and C3) net-
work, where we have only three pairwise distance metrics
(M12, M23 and M13) available for matching persons, and
no access to the labeled data due to privacy concerns. A new
camera, C4, needs to be added into the system quickly, thus,
allowing us to have only very limited labeled data across
the new camera and the existing ones. Our goal in this pa-
per is to learn the pairwise distance metrics (M41, M42 and
M43) between the newly inserted camera(s) and the existing
cameras, using the learned source metrics from the existing
network and a small amount of labeled data available after
installing the new camera(s).
improvement on person re-id [1, 15, 31, 32, 44, 52]. How-
ever, with the notable exception of [25, 26], most of these
works have not yet considered the dynamic nature of a cam-
era network, where new cameras can be introduced at any
time to cover a certain related area that is not well-covered
by the existing network of cameras. To build a more scal-
able person re-identification system, it is very essential to
consider the problem of how to on-board new cameras into
an existing network with little additional effort.
Let us consider K number of cameras in a network for
which we have learned(K2
)number of optimal pairwise
112144
matching metrics, one for each camera pair (see Figure 1
for an illustrative example). However, during an opera-
tional phase of the system, new camera(s) may be temporar-
ily introduced to collect additional information, which ide-
ally should be integrated with minimal effort. Given newly
introduced camera(s), the traditional re-id methods aim to
re-learn the pairwise matching metrics using a costly train-
ing phase. This is impractical in many situations where the
newly added camera(s) need to be operational soon after
they are added. In this case, we cannot afford to wait a long
time to obtain significant amount of labeled data for learn-
ing pairwise metrics, thus, we only have limited labeled data
of persons that appear in the entire camera network after ad-
dition of the new camera(s).
Recently published works [25, 26] attempt to address the
problem of on-boarding new cameras to a network by utiliz-
ing old data that were collected in the original camera net-
work, combined with newly collected data in the expanded
network, and source metrics to learn new pairwise met-
rics. They also assume the same set of people in all camera
views, including the new camera (i.e., before and after on-
boarding new cameras) for measuring the view similarity.
However, this is unrealistic in many surveillance scenarios
as source camera data may have been lost or not accessible
due to privacy concerns. Additionally, new people may ap-
pear after the target camera is installed who may or may not
have appeared in existing cameras. Motivated by this obser-
vation, we pose an important question: How can we swiftly
on-board new camera(s) in an existing re-id framework (i)
without having access to the source camera data that the
original network was trained on, and (ii) relying upon only
a small amount of labeled data during the transient phase,
i.e., after adding the new camera(s).
Transfer learning, which focuses on transferring knowl-
edge from a source to a target domain, has recently been
very successful in various computer vision problems [18,
23, 30, 46, 49]. However, knowledge transfer in our sys-
tem is challenging, because of limited labeled data and ab-
sence of source camera data while on-boarding new cam-
eras. To solve these problems, we develop an efficient
model adaptation approach using hypothesis transfer learn-
ing that aims to transfer the knowledge using only source
models (i.e., learned metrics) and limited labeled data, but
without using any original source camera data. Only a few
labeled identities that are seen by the target camera, and
one or more of the source cameras, are needed for effective
transfer of source knowledge to the newly introduced target
cameras. Henceforth, we will refer to this as target data.
Furthermore, unlike [25, 26], which identify only one best
source camera that aligns maximally with the target camera,
our approach focuses on identifying an optimal weighted
combination of multiple source models for transferring the
knowledge.
Our approach works as follows. Given a set of pairwise
source metrics and limited labeled target data after adding
the new camera(s), we develop an efficient convex opti-
mization formulation based on hypothesis transfer learn-
ing [4, 13] that minimizes the effect of negative transfer
from any outlier source metric while transferring knowl-
edge from source to the target cameras. More specifically,
we learn the weights of different source metrics and the op-
timal matching metric jointly by alternating minimization,
where the weighted source metric is used as a biased reg-
ularizer that aids in learning the optimal target metric only
using limited labeled data. The proposed method, essen-
tially, learns which camera pairs in the existing source net-
work best describe the environment that is covered by the
new camera and one of the existing cameras. Note that our
proposed approach can be easily extended to multiple addi-
tional cameras being introduced at a time in the network or
added sequentially one after another.
1.1. Contributions
We address the problem of swiftly on-boarding new
camera(s) into an existing person re-identification network
without having access to the source camera data, and rely-
ing upon only a small amount of labeled target data in the
transient phase, i.e., after adding the new cameras. Towards
solving the problem, we make the following contributions.
• We propose a robust and efficient multiple metric hy-
pothesis transfer learning algorithm to efficiently adapt
a newly introduced camera to an existing person re-id
framework without having access to the source data.
• We theoretically analyse the properties of our algo-
rithm and show that it minimizes the risk of negative
transfer and performs closely to fully supervised case
even when a small amount of labeled data is available.
• We perform rigorous experiments on multiple bench-
mark datasets to show the effectiveness of our pro-
posed approach over existing alternatives.
2. Related Work
Person Re-identification. Most of the methods in person
re-id are based on supervised learning. These methods ap-
ply extensive training using lots of manually labeled train-
ing data, and can be broadly classified in two categories: (i)
Distance metric learning based [9, 12, 16, 37, 45, 47] (ii)
Deep learning based [1, 28, 33, 40, 44, 52, 53]. Distance
metric learning based methods tend to learn distance met-
rics for camera pairs using pairwise labeled data between
those cameras, whereas end-to-end Deep learning based
methods tend to learn robust feature representations of the
persons, taking into consideration all the labeled data across
12145
all the cameras at once. To overcome the problem of man-
ual labeling, several unsupervised [17, 18, 34, 43, 47, 48]
and semi-supervised [5, 38, 39, 41] methods have been de-
veloped over the past decade. However, these methods do
not consider the case where new cameras are added to an
existing network. The most recent approach in this direc-
tion [25, 26] has considered unsupervised domain adapta-
tion of the target camera by making a strong assumption of
accessibility of the source data. None of these methods have
considered the fact of not having access to the source data
in the dynamic camera network setting. This is relevant, as
source camera data might have been deleted after a while
due to privacy concerns.
Hypothesis Transfer Learning. Hypothesis transfer learn-
ing [4, 13, 19, 24, 42] is a type of transfer learning that uses
only the learned classifiers from a source domain to effi-
ciently learn a classifier in the target domain, which con-
tains only limited labeled data. This approach is practically
appealing as it does not assume any relationship between
source and target distribution, nor the availability of source
data, which may be non accessible [13]. Most of the litera-
ture has dealt with simple linear classifiers for transferring
knowledge [13, 35]. One recent work [27] has addressed the
problem of transferring the knowledge of a source metric,
which is a positive semi-definite matrix, with some provable
guarantees. However, it has been analyzed for only a single
source metric and the weight of the metric is calculated by
minimizing a cost function using sub-gradient descent from
the generalization bound separately, which is a highly non-
convex non-differential function. In [35], the method has
addressed transfer of multiple linear classifiers in an SVM
framework, where the corresponding weights are calculated
jointly with the target classifiers in a single optimization.
Unlike these approaches, our approach addresses the case
of transfer from multiple source metrics by jointly optimiz-
ing for target metric, as well as the source weights to reduce
the risk of negative transfer.
3. Methodology
Let us consider a camera network with K cameras for
which we have learned a total N =(K2
)pairwise metrics
using extensive labeled data. We wish to install some new
camera(s) in the system that need to be operational soon
after they are added, i.e., without collecting and labeling
lots of new training data. We do not have access to the
old source camera data, rather, we only have the pairwise
source distance metrics. Moreover, we also have access to
only a limited amount of labeled data across the target and
different source cameras, which is collected after installing
the new cameras. Using the source metrics and the limited
pairwise source-target labeled data, we propose to solve a
constrained convex optimization problem (Eq. 1) that aims
to transfer knowledge from the source metrics to the target
efficiently while minimizing the risk of negative transfer.
Formulation. Suppose we have access to the optimal
distance metric Mab ∈ Rd×d for the a and b-th camera
pair of an existing re-id network, where d is the dimen-
sion of the feature representation of the person images and
a, b ∈ {1, 2 . . .K}. We also have limited pairwise labeled
data {(xij , yij)}Ci=1 between the target camera τ and the
source camera s, where xij = (xi − xj) is the feature dif-
ference between image i in camera τ and image j in camera
s, C =(nτs
2
), where nτs is the total number of ordered pair
images across cameras τ and s, and yij ∈ {−1, 1}. yij = 1if the persons i and j are the same person across the cam-
eras, and −1 otherwise. Note that our approach does not
need the presence of every person seen in the new target
camera across all the source cameras; rather, it is enough for
some people in the target camera to be seen in at least one
of the source cameras, in order to compute the new distance
metric across source-target pairs. Let S and D be defined
as S = {(i, j) | yij = 1} and D = {(i, j) | yij = −1}.
Our main goal is to learn the optimal metric between target
and each of the source cameras by using the information
from all the pairwise source metrics {Mj}Nj=1 and limited
labeled data {(xij , yij)}Ci=1. In standard metric learning
context, the distance between two feature vectors xi ∈ Rd
and xj ∈ Rd with respect to a metric M ∈ R
d×d is calcu-
lated by√
(xi − xj)⊤M(xi − xj).Thus, we formulate the following optimization problem
for calculating the optimal metric Mτs between target cam-
era τ and the s-th source camera, with ns and nd number of
similar and dissimilar pairs, as follows:
minimizeMτs, β
1
ns
∑
(i,j)∈S
x⊤ijMτsxij + λ‖Mτs −
N∑
j=1
βjMj‖2F
subject to1
nd
∑
(i,j)∈D
(x⊤ijMτsxij)− b ≥ 0, Mτs � 0,
β ≥ 0, ‖β‖2≤ 1(1)
The above objective consists of two main terms. The first
term is the normalized sum of distances of all similar pair
of features between camera τ and s with respect to the
Mahalanobis metric Mτs, and the second term represents
the Frobenius norm of the difference of Mτs and weighted
combination of source metrics squared. λ is a regularization
parameter to balance the two terms. Note that the second
term in Eq. 1 is essentially related to hypothesis transfer
learning [4, 13] where the hypotheses are the source met-
rics. The first constraint represents that the normalized sum
of distances of all dissimilar pairs of features with respect to
Mτs is greater than a user defined threshold b, and the sec-
ond constraints the distance metrics to always lie in the pos-
12146
itive semi-definite cone. While the third constraint keeps all
the elements of the source weight vector non-negative, the
last constraint ensures that the weights should not deviate
much from zero (through upper-bounding the ℓ-2 norm by
1).
Notation. We use the following notations in the optimiza-
tion steps.
(a) C1 = {M ∈ Rd×d | 1
nd
∑(i,j)∈D
(x⊤ijMxij)− b ≥ 0}
(b) C2 = {M ∈ Rd×d | M � 0}
(c) C3 = {β ∈ RN | β ≥ 0 ∩ ‖β‖2 ≤ 1}
Optimization. The proposed optimization problem (1) is
not jointly convex over Mτs and β. To solve this nonconvex
optimization over large size matrices, we devise an iterative
algorithm to efficiently solve (1) by alternatively solving for
two sub-problems. For the sake of brevity, we denote Mτs
as M in the subsequent steps. Specifically, in the first step,
we fix the weight β and take a gradient step with respect to
M in the descent direction with step size α (Eq. 2). Then,
we project the updated M onto C1 and C2 in an alternating
fashion until convergence (Eq. 3 and Eq. 4). In the next
step, we fix the the updated M and take a step with size
γ towards the direction of negative gradient with respect
to β (Eq. 6). In the last step, we simply project β onto
the set C3 (Eq. 7). Algorithm 1 summarizes the alternating
minimization procedure to optimize (1). We briefly describe
these steps below and refer the reader to the supplementary
material for more mathematical details.
Algorithm 1: Algorithm to Solve Eq. 1
Input: Source metric {Mj}Nj=1, {(xij , yij)}
Ci=1
Output: Optimal metric M⋆
Initialization: Mk, βk, k = 0;
while convergence do
Mk+1 = Mk − α∇Mf(M,βk)|M=Mk (Eq. 2);
while convergence do
Mk+1 = ΠC1(Mk+1) (Eq. 3);
Mk+1 = ΠC2(Mk+1) (Eq. 4);
end
βk+1 = βk − γ∇β(f(Mk+1, β)|β=βk (Eq. 6);
βk+1 = ΠC3(βk+1) (Eq. 7);
k = k + 1 ;
end
Step 1: Gradient w.r.t M with fixed β.
With k being the iteration number and Mk, βk being M
and β in the k-th iteration, we compute the gradient of the
objective function (1) with respect to M by fixing β = βk
at the k-th iteration as follows:
∇Mf(M,βk)|M=Mk = ΣS +2λ(Mk −
N∑
j=1
βkj Mj), (2)
where ΣS = 1ns
∑(i,j)∈S
xijx⊤ij and
f(M,βk) = 1ns
∑(i,j)∈S
x⊤ijMxij + λ‖M −
N∑j=1
βkj Mj‖
2F .
Step 2: Projection of M onto C1 and C2. The projection
of M onto C1 (denoted as ΠC1(M)) can be computed by
solving a constrained optimization as follows:
ΠC1(M) =arg min
M
1
2‖M −M‖2F
Subject to1
nd
∑
(i,j)∈D
(x⊤ijMxij)− b ≥ 0
By writing the Lagrange for the above constrained opti-
mization and using KKT conditions with strong duality, the
projection of M onto C1 can be written as
ΠC1(M) = M+max
0,
(b− 1
nd
∑(i,j)∈D
x⊤ijMxij
)
‖ΣD‖2F
ΣD,
(3)
where ΣD = 1nd
∑(i,j)∈D
xijx⊤ij . Similarly, using spectral
value decomposition, the projection of M onto C2 can be
written as
ΠC2(M) = V diag(
[λ1 λ2 . . . λn
])V ⊤, (4)
where V is the eigenvector matrix of M , λi is the i-th eigen-
value of M and λj = max{λj , 0} ∀ j ∈[1 . . . d
].
Step 3: Gradient w.r.t β with fixed M . By fixing M =Mk+1 in the objective function, differentiating it w.r.t βi,
the i-th element of β at the point β = βk, we get
∇βi(f(Mk+1, β))|βi=βk
i= 2λβk
i trace(M⊤i Mi)−
2λtrace(M⊤i (Mk+1 −
N∑
j=1,j 6=i
βkj Mj))
(5)
By denoting ∇βi(f(Mk+1, β))|βi=βk
ias aki , we get
∇β(f(Mk+1, β))|β=βk =
[ak1 ak2 . . . akN
]⊤(6)
Step 4: Projection of β onto C3. This step essentially
projects a vector to the first quadrant of an N -dimensional
unit norm hyper-sphere. The closed form expression of the
projection onto C3 is as follows:
ΠC3(βk+1) = max
{0,
βk+1
max{1, ‖βk+1‖2}
}(7)
12147
4. Discussion and Analysis
One of the key differences between our approach and
existing methods is that the nature of our problem deals
with the multiple metric setting within the hypothesis trans-
fer learning framework. In this section, following [27], we
theoretically analyze the properties of our Algorithm 1 for
transferring knowledge from multiple metrics.
Let T be a domain defined over the set (X × Y) where
X ⊆ Rd and Y ∈ {−1, 1} denote the feature and label set,
respectively, and has a probability distribution denoted by
DT . Let T be the target domain defined by {(xi, yi)}ni=1
consisting of n i.i.d samples, each drawn from the distribu-
tion DT . The optimization proposed in Eq.1 of [27] (page.
2) is defined as:
minimizeM�0
LT (M) + λ‖M −MS‖2F (8)
Fixing the value of β in our proposed optimization (1),
we have an optimization problem equivalent to (8), where
MS =∑N
j=1 βjMj and
LT (M) =1
ns
∑
(i,j)∈S
x⊤ijMxij+µ⋆
(b−
1
nd
∑
(i,j)∈D
x⊤ijMxij
)
(9)
Note that µ⋆ in Eq. 9 is the optimal dual variable for the
inequality constraint optimization (1) with the weight vector
fixed. Clearly, the expression is linear, hence convex in M ,
and has a finite lipschitz constant k.
Theorem 1. For the convex and k-Lipschitz loss (shown in
supp) defined in (9) the average bound can be expressed as
ET∼DT n [LDT(M⋆)] ≤ LDT
(MS) +8k2
λn, (10)
where n is the number of target labeled examples, M⋆
is the optimal metric computed from Algorithm 1, MS
is the average of all source metrics defined as
∑Nj=1
Mj
N,
ET∼DT n [LDT(M⋆)] is the expected loss by M⋆ computed
over distribution DT and LDT(MS) is the loss of average
of source metrics computed over DT .
Proof. The proof is given in supplementary material.
Implication of Theorem 1: Since we transfer knowledge
from multiple source metrics, and do not know which is
the most generalizable over the target distribution (i.e., the
best source metric), the most sensible thing is to check
for the average performance of using each of the source
metrics directly over the target test data. It is equivalently
giving all the source metrics equal weights and not using
any of the target data for training purpose. The bound in
Theorem (9) shows that, on average, the metric learned
form Algorithm 1 tends to do better than, or in worst case,
at least equivalent to the average of source metrics with
a fast convergence rate of O( 1n) with limited number of
target samples [27].
Theorem 2. With probability (1 − δ), for any metric M
learned from Algorithm 1 we have,
LDT(M) ≤LT (M) +O
( 1n
)+
(√LT (
∑Nj=1 βjMj)
λ+‖
N∑
j=1
βjMj‖F
)√ln( 2
δ)
2n,
(11)
where LDT(M) is the loss over the original target distri-
bution (true risk), LT (M) is the loss over the existing target
data (empirical risk), and n is the number of target samples.
Proof. See the supplementary material for the proof.
Implication of Theorem 2: This bound shows that given
only a small amount of labeled target data, our method per-
forms closely to the fully supervised case. The right hand
side of the inequality (11) consists of the term O(1n
)+
Φ(β)O(
1√n
). Since the optimal weight β⋆ from optimiza-
tion (1) will be sparse due to the way β is constrained, zero
weights will automatically be assigned to the outlier met-
rics, i.e., outlier Mjs, resulting in zero values for the terms
β⋆kLT (Mj) corresponding to those indices j and hence
smaller value of Φ(β). As a result, the O(
1√n
)term will be
less dominant in (11) than O(1n
), due to smaller associated
coeffiecient Φ(β⋆) and, hence, can be ignored. Thus, due
to the faster decay rate of O(1n
), this implies that with very
limited target data, the empirical risk will converge to the
true risk. Furthermore, when n is very large (the fully su-
pervised case), O(
1√n
)will be close to zero and cannot be
altered by multiplication with any coefficient. This implies
that the source metrics will not have any effect on learning
when there is enough labeled target data available and are
only useful in the presence of limited data as in our applica-
tion domain.
Negative Transfer: In optimization (1), we jointly estimate
the optimal metric, as well as the weight vector, which de-
termines which source to transfer from and with how much
weight. If a source metric is not a good representative of the
target distribution, for an optimal λ, the weight associated
to this metric will automatically be set to zero or close to
zero by optimization (1), due to the sparsity constraint of β.
Hence, our approach minimizes the risk of negative transfer.
5. Experiments
Datasets. We test the effectiveness of our method by ex-
perimenting on four publicly available person re-id datasets
such as WARD [20], RAiD [2], Market1501 [50], and
12148
5 10 15 20
Rank
0
20
40
60
80
100
Rec
og
nit
ion
rat
e [%
]
CMC Curves - WARD dataset
Average across camera pairs with all cameras as target
Ours (nAUC: 94.31)
Adapt-GFK (nAUC: 90.32)
Avg-source (nAUC: 87.26)
CAMEL (nAUC: 75.84)
Best-GFK (nAUC: 82.39)
Direct-GFK (nAUC: 81.73)
(a)
5 10 15 20
Rank
0
20
40
60
80
100
Rec
ognit
ion r
ate
[%]
CMC Curves - RAiD dataset
Average across camera pairs with all cameras as target
Ours (nAUC: 88.67)
Adapt-GFK (nAUC: 87.36)
Avg-Source (nAUC: 81.83)
CAMEL (nAUC: 71.43)
Best-GFK (nAUC: 76.07)
Direct-GFK (nAUC: 79.07)
(b)
5 10 15 20
Rank
0
10
20
30
40
50
60
Rec
og
nit
ion
rat
e [%
]
CMC Curves - Market1501 dataset
Average across camera pairs with all cameras as target
Ours (nAUC: 93)
Adapt-GFK (nAUC: 88)
Avg-Source (nAUC: 90)
CAMEL (nAUC: 86)
Best-GFK (nAUC: 78)
Direct-GFK (nAUC: 81)
(c)
5 10 15 20
Rank
0
5
10
15
20
25
30
Rec
ognit
ion r
ate
[%]
CMC Curves - MSMT17 dataset
Average across camera pairs with all cameras as target
Ours (nAUC: 58)
Adapt-GFK (nAUC: 52)
Avg-Source (nAUC: 51)
CAMEL (nAUC: 50)
Best-GFK (nAUC: 46)
Direct-GFK (nAUC: 49)
(d)
Figure 2: CMC curves averaged over all target camera combinations, introduced one at a time. (a) WARD with 3 cameras,
(b) RAiD with 4 cameras, (c) Market1501 with 6 cameras and (d) MSMT17 with 15 cameras. Best viewed in color.
MSMT17 [36]. There are several other re-id datasets like
ViPeR [8], PRID2011 [11] and CUHK01 [14]; however,
those do not apply in our case due to availability of only
two cameras. RAiD and WARD are smaller datasets with
43 and 70 persons captured in 4 and 3 cameras, respec-
tively, whereas Market1501 and MSMT17 are more recent
and large datasets with 1,501 and 4,101 persons captured
across 6 and 15 cameras, respectively.
Feature Extraction and Matching. We use Local Maxi-
mal Occurrence (LOMO) feature [16] of length 29, 960 in
RAiD and WARD datasets. However, since LOMO usually
performs poorly on large datasets [7], for Market1501 and
MSMT17 we extract features from the last layer of an Im-
agenet [3] pre-trained ResNet50 network [10] (denoted as
IDE features in our work). We follow standard PCA tech-
nique to reduce the feature dimension to 100, as in [12, 25].
Performance Measures. We provide standard Cumulative
Matching Curves (CMC) and normalized Area Under Curve
(nAUC), as is common in person re-id [2, 12, 16, 26]. While
the former shows accumulated accuracy by considering the
k-most similar matches within a ranked list, the latter is a
measure of re-id accuracy, independent on the number of
test samples. Due to the space constraint, we only report
average CMC curves for most experiments and leave the
full CMC curves in the supplementary material.
Experimental Settings. For RAiD we follow the protocol
in [16] and randomly split the persons into a training set of
21 persons and a test set of 20 persons, whereas for WARD,
we randomly split the 70 persons into a set of 35 persons for
training and rest 35 persons for testing. For both datasets,
we perform 10 train/test splits and average accuracy across
all splits. We use the standard training and testing splits for
both Market1501 and MSMT17 datasets. During testing,
we follow a multi-query approach by averaging all query
features of each id in the target camera and compare with
all features in the source camera [50].
Compared Methods. We compare our approach with the
following methods. (1) Two variants of Geodesic Flow Ker-
nel (GFK) [6] such as Direct-GFK where the kernel be-
tween a source-target camera pair is directly used to eval-
uate the accuracy and Best-GFK where GFK between the
best source camera and the target camera is used to eval-
uate accuracy between all source-target camera pairs as in
[25, 26]. Both methods use the supervised dimensionality
reduction method, Partial Least Squares (PLS), to project
features into a low dimensional subspace [25, 26]. (2) State-
of-the-art method for on-boarding new cameras [25, 26] that
uses transitive inference over the learned GFK across the
best source and target camera (Adapt-GFK). (3) Clustering-
based Asymmetric MEtric Learning (CAMEL) method of
[47], which projects features from source and target cam-
era to a shared space using a learned projection matrix. For
all compared methods, we use their publicly available code
and perform evaluation in our setting.
5.1. Onboarding a Single New Camera
We consider one camera as newly introduced target cam-
era and all the other as source cameras. We consider all the
possible combinations for conducting experiments. In addi-
tion to the baselines described above, we compare against
the accuracy of average of the source metrics (Avg-Source)
by applying it directly over the target test set to prove the va-
lidity of Theorem 1. We also compute the GFK kernels in
two settings; by considering only target data available after
introducing the new cameras (Figure 2) and by considering
the presence of both old source data and the new labeled
data after camera installation as in [25, 26] (Figure 3).
Implementation details. We split training data into disjoint
source and target data considering the fact that the persons
that appear in the new camera after installation may or may
not be seen before in the source cameras. That is, for Mar-
ket1501 and MSMT17, we split the training data into 90%
of persons that are only seen by the source cameras and
10% that are seen in both source cameras and the new tar-
get camera after the installation. Since there are much fewer
persons in RAiD and WARD training set, we split the per-
sons into 80% source and 20% target for those two datasets.
For each dataset, we evaluate every source-target pair and
average accuracy across all pairs. Furthermore, we average
accuracy across all cameras as target. Note that the train
12149
5 10 15 20
Rank
0
20
40
60
80
100R
ecognit
ion r
ate
[%]
CMC Curves - WARD dataset
Average across camera pairs with all cameras as target
Ours (nAUC: 94.31)
Adapt-GFK (nAUC: 89.03)
Best-GFK (nAUC: 80.01)
Direct-GFK (nAUC: 78.59)
Figure 3: CMC curves averaged over all the target cam-
era combinations, introduced one at a time, on the WARD
dataset. Note that both old and new source data are used for
calculation of GFK. Best viewed in color.
and test set are kept disjoint in all our experiments.
Results. Figure 2 and 3 show the results. In all cases, our
method outperforms all the compared methods. The most
competitive methods are those of Adapt-GFK and Avg-
Source that also use source metrics. For the remaining
methods, we see the limitation of only using limited tar-
get data to compute the new metrics. For Market1501, we
see that Avg-Source outperforms the Adapt-GFK baseline
indicating the advantage of knowledge transfer from multi-
ple source metric compared to one single best source met-
ric as in [25, 26]. However, our approach still outperforms
the Avg-Source baseline by a margin of 20.60%, 13.81%,
2.01% and 1.07% in Rank-1 accuracy on RAiD, WARD,
Market1501 and MSMT17, respectively, validating our im-
plications of Theorem 1. Furthermore, we observe that even
without accessing the source training data that was used
for training the network before adding a new camera, our
method outperforms the GFK based methods that use all the
source data in their computations (see Figure 3). To sum-
marize, the experimental results show that our method per-
forms better on both small and large camera networks with
limited supervision, as it is able to adapt multiple source
metrics through reducing negative transfer by dynamically
weighting the source metrics.
5.2. Onboarding Multiple New Cameras
We perform this experiment on Market1501 dataset us-
ing the same strategy as in Section 5.1 and compare our re-
sults with other methods while adding multiple target cam-
eras to the network, either continuously or in parallel.
Parallel On-boarding of Cameras: We randomly se-
lect two or three cameras as target while keeping the re-
maining as source. All the new target cameras are tested
against both source cameras and other target cameras. The
results of adding two and three cameras in parallel (at the
same time) are shown in Figure 4 (a) and (b), respectively.
In both cases, our method outperforms all the compared
methods with an increasing margin as rank increases. We
outperform the most competitive CAMEL in Rank-1 accu-
racy by 5.45% and 3.73%, while adding two and three cam-
eras respectively. Furthermore, our method better adapts
source metrics since it has the capability of assigning zero
weights to the metrics that do not generalize well over target
data. Meanwhile, Adapt-GFK has a high probability of us-
ing the outlier source metrics in the presence of fewer avail-
able source metrics, which causes negative transfer. This
has been shown in Figure 4 where GFK based methods are
performing worse than CAMEL, which is computed just
with limited supervision without using any source metrics.
Sequential On-boarding of Cameras: For this exper-
iment, we randomly select three target cameras that are
added sequentially. A target camera is tested against all
source cameras and previously added target cameras. The
results are shown in Figure 4 (c). Similar to parallel on-
boarding, our methods outperforms compared methods by
a large margin. In this setting, we outperform CAMEL
by 8.22% in Rank-1 accuracy. Additionally, compared to
all GFK-based methods, the Rank-1 margin is kept con-
stant at 10% for both parallel and sequential on-boarding.
These results show the scalability of our proposed method
while adding multiple cameras to a network, irrespective of
whether they are added in parallel or sequentially.
5.3. Different Labeled Data in New Cameras
We perform this experiment to show the implications of
Theorem 2 by using different percentages of labeled tar-
get data (10%, 20%, 30%, 50%, 75% and 100%) in our
method. We compare with a widely used KISS metric learn-
ing (KISSME) [12] algorithm and show the difference in
Rank-1 accuracy as a function of labeled target data. Fig-
ure 5 (a) shows the results. At only 10% labeled data, the
difference between our method and KISSME [12] is almost
30%; however, as we add more labeled data, the Rank-1
accuracy becomes equivalent for the two methods at 100%
labeled data. This confirms the implications of Theorem 2,
where we showed that with increasing labeled target data,
the effect of source metrics in learning becomes negligible.
5.4. Finetuning with Deep Features
This section shows the strength of our method while
comparing with CNN features extracted from a network
trained on the source data (we train a ResNet50 model [10],
pretrained on the Imagenet dataset). Without transfer learn-
ing, we have two options: (a) directly use the source model
to extract features in the target and do matching based on
Euclidean/KISSME metric (IDE), (b) finetune the source
model using limited target data and then extract features
to do matching using Euclidean/KISSME (finetuned). We
12150
2 4 6 8 10 12 14 16 18 20
Rank
0
10
20
30
40
50
60
70R
eco
gn
itio
n r
ate
[%]
CMC Curves - Market1501 dataset
Average across camera pairs when Cameras 4 and 5 are target
Ours (nAUC: 93.17)
Adapt-GFK (nAUC: 83.13)
CAMEL (nAUC: 88.66)
Best-GFK (nAUC: 81.41)
Direct-GFK (nAUC: 81.53)
(a)
2 4 6 8 10 12 14 16 18 20
Rank
0
10
20
30
40
50
60
70
Rec
og
nit
ion
rat
e [%
]
CMC Curves - Market1501 dataset
Average across camera pairs when Cameras 1, 3 and 4 are target
Ours (nAUC: 93.20)
Adapt-GFK (nAUC: 82.76)
CAMEL (nAUC: 88.86)
Best-GFK (nAUC: 79.23)
Direct-GFK (nAUC: 80.62)
(b)
2 4 6 8 10 12 14 16 18 20
Rank
0
10
20
30
40
50
60
70
Rec
og
nit
ion
rat
e [%
]
CMC Curves - Market1501 dataset
Average across camera pairs when Cameras 1, 2 and 6 are target
Ours (nAUC: 92.76)
Adapt-GFK (nAUC: 85.63)
CAMEL (nAUC: 84.83)
Best-GFK (nAUC: 78.58)
Direct-GFK (nAUC: 81.33)
(c)
Figure 4: CMC curves averaged across target cameras on Market1501 dataset. (a) and (b) show results while adding two and
three cameras in parallel, (c) show result while adding three cameras sequentially one after another. Best viewed in color.
10 20 30 50 75 100
Percentage labeling [%]
0
10
20
30
40
50
60
70
Ran
k-1
acc
ura
cy [
%]
Ours
KISSME
Difference
(a)
2 4 8 10 20
Percentage labeling [%]
0
10
20
30
40
50
Ran
k-1
acc
ura
cy [
%]
Ours (IDE)
KISSME (IDE)
Euclidean (IDE)
Ours (finetuned)
KISSME (finetuned)
Euclidean (finetuned)
(b)
10-5
10-4
10-3
10-2
10-1
16
18
20
22
24
26
28
30
32
34
Ran
k-1
acc
ura
cy [
%]
10% labeling
2% labeling
(c)
Figure 5: (a) Effect of different percentage of target labelling on WARD dataset for justifying Theorem 2, (b) Analysis of our
method with deep features trained on source camera data in Market1501 dataset with 6th camera as target, (c) Sensitivity of
λ on the Rank-1 performance tested using deep features in Market1501 with 6th camera as target. Best viewed in color.
compared these baselines with our method with different
percentage of labeling on Market1501 dataset, where the
pairwise metrics are computed using the source features
extracted from the model without any finetuning. We use
those source metrics along with the target features, ex-
tracted before (Ours(IDE)) and after finetuning the source
model (Ours(finetuned)). Please see supplementary ma-
terial for more details. Figure 5 (b) shows the results.
Ours(IDE) outperforms Euclidean(IDE) by a margin of
10% on Market with 20% of labeled target data. The dif-
ference between Ours(finetuned) and Euclidean/KISSME
(finetuned) is more noticeable with less labeled data and it
becomes smaller with increase in labeled target data (Theo-
rem 2). However Ours(finetuned) consistently outperforms
all the other baselines for up to 20% labeling.
5.5. Parameter Sensitivity
We perform this experiment to study the effect of λ in op-
timization (1) for a given percentage of labeled target data.
Figure 5 (c) shows the Rank-1 accuracy of our proposed
method for different values of λ. From optimization 1,
when λ → ∞ the left term can be neglected resulting in
optimal M and β to be zero. However, when λ → 0, the
regularization term is neglected resulting in no transfer. We
can see from Figure 5 (c) that there is an operating zone of
λ (e.g., in the range of 10−4 to 10−2), that is neither too
high nor too low for useful transfer from source metrics.
6. Conclusions
We addressed a critically important problem in person
re-identification which has received little attention thus far -
how to quickly on-board new cameras into an existing cam-
era network. We showed this can be addressed effectively
using hypothesis transfer learning using only learned source
metrics and a limited amount of labeled data collected after
installing the new camera(s). We provided theoretical anal-
ysis to show that our approach minimizes the effect of neg-
ative transfer through finding an optimal weighted combi-
nation of multiple source metrics. We showed the effective-
ness of our approach on four standard datasets, significantly
outperforming several baseline methods.
Acknowledgements. This work was partially supported
by ONR grant N00014-19-1-2264 and NSF grant 1724341.
12151
References
[1] Ejaz Ahmed, Michael Jones, and Tim K Marks. An improved
deep learning architecture for person re-identification. In
CVPR, pages 3908–3916, 2015. 1, 2[2] Abir Das, Anirban Chakraborty, and Amit K Roy-
Chowdhury. Consistent re-identification in a camera net-
work. In ECCV, pages 330–345. Springer, 2014. 5, 6[3] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li,
and Li Fei-Fei. Imagenet: A large-scale hierarchical image
database. In CVPR, pages 248–255. Ieee, 2009. 6[4] Simon S Du, Jayanth Koushik, Aarti Singh, and Barnabas
Poczos. Hypothesis transfer learning via transformation
functions. In Advances in Neural Information Processing
Systems, pages 574–584, 2017. 2, 3[5] Hehe Fan, Liang Zheng, Chenggang Yan, and Yi Yang.
Unsupervised person re-identification: Clustering and fine-
tuning. ACM Transactions on Multimedia Computing, Com-
munications, and Applications (TOMM), 14(4):83, 2018. 3[6] Boqing Gong, Yuan Shi, Fei Sha, and Kristen Grauman.
Geodesic flow kernel for unsupervised domain adaptation.
In CVPR, pages 2066–2073. IEEE, 2012. 6[7] Mengran Gou, Ziyan Wu, Angels Rates-Borras, Octavia
Camps, Richard J Radke, et al. A systematic evaluation
and benchmark for person re-identification: Features, met-
rics, and datasets. IEEE transactions on pattern analysis and
machine intelligence, 41(3):523–536, 2018. 1, 6[8] Douglas Gray and Hai Tao. Viewpoint invariant pedestrian
recognition with an ensemble of localized features. In ECCV,
pages 262–275. Springer, 2008. 6[9] Matthieu Guillaumin, Jakob Verbeek, and Cordelia Schmid.
Is that you? metric learning approaches for face identifica-
tion. In CVPR, pages 498–505. IEEE, 2009. 2[10] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.
Deep residual learning for image recognition. In CVPR,
pages 770–778, 2016. 6, 7[11] Martin Hirzer, Csaba Beleznai, Peter M Roth, and Horst
Bischof. Person re-identification by descriptive and discrim-
inative classification. In Scandinavian conference on Image
analysis, pages 91–102. Springer, 2011. 6[12] Martin Koestinger, Martin Hirzer, Paul Wohlhart, Peter M
Roth, and Horst Bischof. Large scale metric learning from
equivalence constraints. In CVPR, pages 2288–2295. IEEE,
2012. 2, 6, 7[13] Ilja Kuzborskij and Francesco Orabona. Stability and hy-
pothesis transfer learning. In ICML, pages 942–950, 2013.
2, 3[14] Wei Li, Rui Zhao, and Xiaogang Wang. Human reidentifica-
tion with transferred metric learning. In ACCV, pages 31–44.
Springer, 2012. 6[15] Wei Li, Xiatian Zhu, and Shaogang Gong. Harmonious at-
tention network for person re-identification. In CVPR, pages
2285–2294, 2018. 1[16] Shengcai Liao, Yang Hu, Xiangyu Zhu, and Stan Z. Li. Per-
son re-identification by local maximal occurrence represen-
tation and metric learning. In CVPR, pages 2197–2206, June
2015. 1, 2, 6[17] Yutian Lin, Xuanyi Dong, Liang Zheng, Yan Yan, and Yi
Yang. A bottom-up clustering approach to unsupervised per-
son re-identification. In AAAI, volume 33, pages 8738–8745,
2019. 3[18] Jianming Lv, Weihang Chen, Qing Li, and Can Yang. Un-
supervised cross-dataset person re-identification by transfer
learning of spatial-temporal patterns. In CVPR, pages 7948–
7956, 2018. 2, 3[19] Yishay Mansour, Mehryar Mohri, and Afshin Rostamizadeh.
Domain adaptation with multiple sources. In Advances in
neural information processing systems, pages 1041–1048,
2009. 3[20] Niki Martinel and Christian Micheloni. Re-identify people
in wide area camera network. In 2012 IEEE computer so-
ciety conference on computer vision and pattern recognition
workshops, pages 31–36. IEEE, 2012. 5[21] Tetsu Matsukawa, Takahiro Okabe, Einoshin Suzuki, and
Yoichi Sato. Hierarchical gaussian descriptor for person re-
identification. In CVPR, pages 1363–1372, 2016. 1[22] Tetsu Matsukawa, Takahiro Okabe, Einoshin Suzuki, and
Yoichi Sato. Hierarchical gaussian descriptors with applica-
tion to person re-identification. IEEE transactions on pattern
analysis and machine intelligence, 2019. 1[23] Hyeonwoo Noh, Taehoon Kim, Jonghwan Mun, and Bo-
hyung Han. Transfer learning via unsupervised task discov-
ery for visual question answering. In CVPR, pages 8385–
8394, 2019. 2[24] Francesco Orabona, Claudio Castellini, Barbara Caputo, An-
gelo Emanuele Fiorilla, and Giulio Sandini. Model adapta-
tion with least-squares svm for adaptive hand prosthetics. In
2009 IEEE International Conference on Robotics and Au-
tomation, pages 2897–2903. IEEE, 2009. 3[25] Rameswar Panda, Amran Bhuiyan, Vittorio Murino, and
Amit K Roy-Chowdhury. Unsupervised adaptive re-
identification in open world dynamic camera networks. In
CVPR, pages 7054–7063, 2017. 1, 2, 3, 6, 7[26] Rameswar Panda, Amran Bhuiyan, Vittorio Murino, and
Amit K Roy-Chowdhury. Adaptation of person re-
identification models for on-boarding new camera (s). Pat-
tern Recognition, 96:106991, 2019. 1, 2, 3, 6, 7[27] Michael Perrot and Amaury Habrard. A theoretical analysis
of metric hypothesis transfer learning. In ICML, pages 1708–
1717, 2015. 3, 5[28] Xuelin Qian, Yanwei Fu, Yu-Gang Jiang, Tao Xiang, and
Xiangyang Xue. Multi-scale deep learning architectures for
person re-identification. In ICCV, pages 5399–5408, 2017.
2[29] Amit K Roy-Chowdhury and Bi Song. Camera networks:
The acquisition and analysis of videos over wide areas. Syn-
thesis Lectures on Computer Vision, 3(1):1–133, 2012. 1[30] Ruoqi Sun, Xinge Zhu, Chongruo Wu, Chen Huang, Jian-
ping Shi, and Lizhuang Ma. Not all areas are equal: Transfer
learning for semantic segmentation via hierarchical region
selection. In CVPR, pages 4360–4369, 2019. 2[31] Yifan Sun, Qin Xu, Yali Li, Chi Zhang, Yikang Li, Shengjin
Wang, and Jian Sun. Perceive where to focus: Learn-
ing visibility-aware part-level features for partial person re-
identification. In CVPR, pages 393–402, 2019. 1[32] Chiat-Pin Tay, Sharmili Roy, and Kim-Hui Yap. Aanet: At-
tribute attention network for person re-identifications. In
CVPR, pages 7134–7143, 2019. 1[33] Guangcong Wang, Jianhuang Lai, Peigen Huang, and Xiao-
12152
hua Xie. Spatial-temporal person re-identification. In AAAI,
volume 33, pages 8933–8940, 2019. 2[34] Jingya Wang, Xiatian Zhu, Shaogang Gong, and Wei Li.
Transferable joint attribute-identity deep learning for unsu-
pervised person re-identification. In CVPR, pages 2275–
2284, 2018. 3[35] Yu-Xiong Wang and Martial Hebert. Learning by transfer-
ring from unsupervised universal sources. In AAAI, pages
2187–2193. 3[36] Longhui Wei, Shiliang Zhang, Wen Gao, and Qi Tian.
Person transfer gan to bridge domain gap for person re-
identification. In CVPR, pages 79–88, 2018. 6[37] Kilian Q Weinberger and Lawrence K Saul. Distance met-
ric learning for large margin nearest neighbor classification.
Journal of Machine Learning Research, 10(Feb):207–244,
2009. 2[38] Ancong Wu, Wei-Shi Zheng, Xiaowei Guo, and Jian-Huang
Lai. Distilled person re-identification: Towards a more scal-
able system. In CVPR, pages 1187–1196, 2019. 3[39] Yu Wu, Yutian Lin, Xuanyi Dong, Yan Yan, Wanli Ouyang,
and Yi Yang. Exploit the unknown gradually: One-shot
video-based person re-identification by stepwise learning. In
CVPR, pages 5177–5186, 2018. 3[40] Tong Xiao, Hongsheng Li, Wanli Ouyang, and Xiaogang
Wang. Learning deep feature representations with domain
guided dropout for person re-identification. In CVPR, pages
1249–1258, 2016. 2[41] Xiaomeng Xin, Jinjun Wang, Ruji Xie, Sanping Zhou, Wenli
Huang, and Nanning Zheng. Semi-supervised person re-
identification using multi-view clustering. Pattern Recog-
nition, 88:285–297, 2019. 3[42] Jun Yang, Rong Yan, and Alexander G Hauptmann. Cross-
domain video concept detection using adaptive svms. In
ACM MM, pages 188–197. ACM, 2007. 3[43] Qize Yang, Hong-Xing Yu, Ancong Wu, and Wei-Shi Zheng.
Patch-based discriminative feature learning for unsupervised
person re-identification. In CVPR, pages 3633–3642, 2019.
3[44] Wenjie Yang, Houjing Huang, Zhang Zhang, Xiaotang Chen,
Kaiqi Huang, and Shu Zhang. Towards rich feature discov-
ery with class activation maps augmentation for person re-
identification. In CVPR, pages 1389–1398, 2019. 1, 2[45] Xun Yang, Meng Wang, and Dacheng Tao. Person re-
identification with metric learning using privileged informa-
tion. IEEE Transactions on Image Processing, 27(2):791–
805, 2017. 2[46] Xi Yin, Xiang Yu, Kihyuk Sohn, Xiaoming Liu, and Man-
mohan Chandraker. Feature transfer learning for face recog-
nition with under-represented data. In CVPR, pages 5704–
5713, 2019. 2[47] Hong-Xing Yu, Ancong Wu, and Wei-Shi Zheng. Cross-
view asymmetric metric learning for unsupervised person re-
identification. In ICCV, pages 994–1002, 2017. 2, 3, 6[48] Hong-Xing Yu, Wei-Shi Zheng, Ancong Wu, Xiaowei Guo,
Shaogang Gong, and Jian-Huang Lai. Unsupervised person
re-identification by soft multilabel learning. In CVPR, pages
2148–2157, 2019. 3[49] Amir R Zamir, Alexander Sax, William Shen, Leonidas J
Guibas, Jitendra Malik, and Silvio Savarese. Taskonomy:
Disentangling task transfer learning. In CVPR, pages 3712–
3722, 2018. 2[50] Liang Zheng, Liyue Shen, Lu Tian, Shengjin Wang, Jing-
dong Wang, and Qi Tian. Scalable person re-identification:
A benchmark. In ICCV, pages 1116–1124, 2015. 5, 6[51] Liang Zheng, Yi Yang, and Alexander G Hauptmann. Per-
son re-identification: Past, present and future. arXiv preprint
arXiv:1610.02984, 2016. 1[52] Zhedong Zheng, Xiaodong Yang, Zhiding Yu, Liang Zheng,
Yi Yang, and Jan Kautz. Joint discriminative and generative
learning for person re-identification. In CVPR, pages 2138–
2147, 2019. 1, 2[53] Sanping Zhou, Jinjun Wang, Jiayun Wang, Yihong Gong,
and Nanning Zheng. Point to set similarity based deep fea-
ture learning for person re-identification. In CVPR, pages
3741–3750, 2017. 2
12153