dictionary alignment with re-ranking for low-resolution ... · space. a compact binary face...

11
886 IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, VOL. 14, NO. 4, APRIL 2019 Dictionary Alignment With Re-Ranking for Low-Resolution NIR-VIS Face Recognition Sivaram Prasad Mudunuri, Shashanka Venkataramanan, and Soma Biswas Abstract—Recently, near-infrared (NIR) images are increas- ingly being captured for recognizing faces in low-light/night- time conditions. Matching these images against the controlled high-resolution visible facial images usually present in the data- base is a challenging task. In surveillance scenarios, the NIR images can have very low resolution and also non-frontal pose which makes the problem even more challenging. In this paper, we propose an orthogonal dictionary alignment approach for addressing this problem. We also propose a re-ranking approach to further improve the recognition performance for each probe by combining the rank list given by the proposed algorithm with that given by another complementary feature/algorithm. Finally, we have also collected our own database Heterogeneous face recognition across Pose and Resolution (HPR) that has facial images captured from two surveillance quality NIR cameras and one high-resolution visible camera, with significant variations in head pose and resolution. Extensive experiments on the modified CASIA NIR-VIS 2.0 database, the Surveillance Camera face database, and our HPR database show the effectiveness of the proposed approaches and the collected database. Index Terms— Heterogeneous face recognition, re-ranking, low-resolution, head pose. I. I NTRODUCTION T O facilitate face recognition in low-light or night-time conditions, near-infrared (NIR) facial images are becom- ing quite common. In contrast, as most of the enrolled images in the database are controlled visible (VIS) images, this results in the heterogeneous face recognition problem. Several approaches have been proposed to address this task [36], [39], but most of them deal with good resolution images. Recently, for security applications, more and more surveillance cameras are being installed at several places. The NIR images captured by these surveillance cameras usually have low-resolution in addition to considerable variations in pose (Fig. 1). This makes the problem even more challenging, since now the low-resolution (LR), uncontrolled NIR probe images needs to be matched against the high-resolution (HR) controlled VIS Manuscript received March 11, 2018; revised June 6, 2018 and July 23, 2018; accepted August 13, 2018. Date of publication August 31, 2018; date of current version October 30, 2018. This work was partially supported by MeitY, India, through a research grant. The associate editor coordinating the review of this manuscript and approving it for publication was Dr. Andrew Beng Jin Teoh. (Sivaram Prasad Mudunuri and Shashanka Venkataramanan contributed equally to this work.) (Corresponding author: Soma Biswas.) The authors are with the Department of Electrical Engineering, Indian Institute of Science, Bengaluru 560012, India (e-mail: [email protected]; [email protected]; [email protected]). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TIFS.2018.2868173 Fig. 1. Sample images from our HPR database. Column 1-7: HR VIS (Row 1) and NIR faces (Row 2) at 7 different poses (pose00-pose06). Column(1,2), (3,4), (5,6,7) of row 2 shows images at 8ft., 12ft. and 16ft. respectively from the camera. gallery images. But this scenario is relatively less explored in literature [15], [41]. The contribution of this work is three-fold. First, we propose a dictionary alignment based approach to address this chal- lenging problem. Since one-to-one correspondence between the HR VIS and LR NIR images may not be available during training, we propose to learn two orthogonal dictionaries for the two domains independently. In order to be able to match the coefficients computed from independently learnt dictionaries, we then learn the correspondence between the atoms of the two dictionaries. Finally, metric learning is used to make the coefficients discriminative since the final objective is recognition. The proposed framework can also be useful for addressing other cross-domain matching problems like photo vs. sketch, RGB vs. thermal images, etc. It has been observed that given a probe and its retrieved results by any algorithm, the ranked scores and the information about the neighbors can potentially be utilized to further improve the retrieval performance. Our second contribution is a novel re-ranking algorithm, which takes the ranks (scores) given by the proposed approach and any other approach to further improve the recognition performance. We design our re-ranking approach by analyzing strongly similar, dissimilar and the neutral gallery images given by the two techniques. The proposed re-ranking approach is general and can help to improve the performance of any two baseline algorithms. Extensive experiments conducted on the modified CASIA NIR-VIS 2.0 database [23], Surveillance Camera face (SCface) database [9] and the proposed HPR database demonstrates the usefulness of the proposed dictionary alignment approach and the re-ranking approach. In order to advance the state-of-the-art in this area, we feel that a systematic analysis of the effect of variations in pose 1556-6013 © 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

Upload: others

Post on 25-Jul-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Dictionary Alignment With Re-Ranking for Low-Resolution ... · space. A compact binary face descriptor that computes the descriptors from the pixel difference vectors in each local

886 IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, VOL. 14, NO. 4, APRIL 2019

Dictionary Alignment With Re-Rankingfor Low-Resolution NIR-VIS

Face RecognitionSivaram Prasad Mudunuri, Shashanka Venkataramanan, and Soma Biswas

Abstract— Recently, near-infrared (NIR) images are increas-ingly being captured for recognizing faces in low-light/night-time conditions. Matching these images against the controlledhigh-resolution visible facial images usually present in the data-base is a challenging task. In surveillance scenarios, the NIRimages can have very low resolution and also non-frontal posewhich makes the problem even more challenging. In this paper,we propose an orthogonal dictionary alignment approach foraddressing this problem. We also propose a re-ranking approachto further improve the recognition performance for each probeby combining the rank list given by the proposed algorithm withthat given by another complementary feature/algorithm. Finally,we have also collected our own database Heterogeneous facerecognition across Pose and Resolution (HPR) that has facialimages captured from two surveillance quality NIR cameras andone high-resolution visible camera, with significant variations inhead pose and resolution. Extensive experiments on the modifiedCASIA NIR-VIS 2.0 database, the Surveillance Camera facedatabase, and our HPR database show the effectiveness of theproposed approaches and the collected database.

Index Terms— Heterogeneous face recognition, re-ranking,low-resolution, head pose.

I. INTRODUCTION

TO facilitate face recognition in low-light or night-timeconditions, near-infrared (NIR) facial images are becom-

ing quite common. In contrast, as most of the enrolled imagesin the database are controlled visible (VIS) images, thisresults in the heterogeneous face recognition problem. Severalapproaches have been proposed to address this task [36], [39],but most of them deal with good resolution images. Recently,for security applications, more and more surveillance camerasare being installed at several places. The NIR images capturedby these surveillance cameras usually have low-resolutionin addition to considerable variations in pose (Fig. 1). Thismakes the problem even more challenging, since now thelow-resolution (LR), uncontrolled NIR probe images needs tobe matched against the high-resolution (HR) controlled VIS

Manuscript received March 11, 2018; revised June 6, 2018 andJuly 23, 2018; accepted August 13, 2018. Date of publication August 31, 2018;date of current version October 30, 2018. This work was partially supportedby MeitY, India, through a research grant. The associate editor coordinatingthe review of this manuscript and approving it for publication was Dr. AndrewBeng Jin Teoh. (Sivaram Prasad Mudunuri and Shashanka Venkataramanancontributed equally to this work.) (Corresponding author: Soma Biswas.)

The authors are with the Department of Electrical Engineering, IndianInstitute of Science, Bengaluru 560012, India (e-mail: [email protected];[email protected]; [email protected]).

Color versions of one or more of the figures in this paper are availableonline at http://ieeexplore.ieee.org.

Digital Object Identifier 10.1109/TIFS.2018.2868173

Fig. 1. Sample images from our HPR database. Column 1-7: HR VIS(Row 1) and NIR faces (Row 2) at 7 different poses (pose00-pose06).Column(1,2), (3,4), (5,6,7) of row 2 shows images at 8ft., 12ft. and 16ft.respectively from the camera.

gallery images. But this scenario is relatively less explored inliterature [15], [41].

The contribution of this work is three-fold. First, we proposea dictionary alignment based approach to address this chal-lenging problem. Since one-to-one correspondence betweenthe HR VIS and LR NIR images may not be available duringtraining, we propose to learn two orthogonal dictionariesfor the two domains independently. In order to be able tomatch the coefficients computed from independently learntdictionaries, we then learn the correspondence between theatoms of the two dictionaries. Finally, metric learning is usedto make the coefficients discriminative since the final objectiveis recognition. The proposed framework can also be usefulfor addressing other cross-domain matching problems likephoto vs. sketch, RGB vs. thermal images, etc.

It has been observed that given a probe and its retrievedresults by any algorithm, the ranked scores and the informationabout the neighbors can potentially be utilized to furtherimprove the retrieval performance. Our second contribution isa novel re-ranking algorithm, which takes the ranks (scores)given by the proposed approach and any other approach tofurther improve the recognition performance. We design ourre-ranking approach by analyzing strongly similar, dissimilarand the neutral gallery images given by the two techniques.The proposed re-ranking approach is general and can helpto improve the performance of any two baseline algorithms.Extensive experiments conducted on the modified CASIANIR-VIS 2.0 database [23], Surveillance Camera face (SCface)database [9] and the proposed HPR database demonstrates theusefulness of the proposed dictionary alignment approach andthe re-ranking approach.

In order to advance the state-of-the-art in this area, we feelthat a systematic analysis of the effect of variations in pose

1556-6013 © 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

Page 2: Dictionary Alignment With Re-Ranking for Low-Resolution ... · space. A compact binary face descriptor that computes the descriptors from the pixel difference vectors in each local

MUDUNURI et al.: DICTIONARY ALIGNMENT WITH RE-RANKING FOR LOW-RESOLUTION NIR-VIS FACE RECOGNITION 887

and distance between the subject and the camera on therecognition performance will be useful. Most of the availabledatasets for this problem does not suit this purpose. In thiswork, to facilitate this study, we introduce a new database,termed as HPR (Heterogeneous face recognition across Poseand Resolution), which is our third contribution. It consistsof 200 subjects captured using two different NIR surveillancecameras and one HR VIS camera.

A preliminary version of this work appeared in [29]. Theadditional contribution of this paper as compared to its pre-liminary version are as follows: (1) First, a novel re-rankingapproach is proposed in Section IV; (2) Second, we introducea new database HPR in Section V. Three different protocols arealso described illustrating the effect of head pose and distancebetween the camera and the subject (Section VI D); (3) Finally,extensive experiments are conducted on the modified CASIANIR-VIS 2.0 database, SCface database and our proposeddatabase in Section VI to demonstrate the effectiveness of ourproposed approach. Extensive analysis including the effect ofsuper-resolution, different resolutions, face hallucination anddifferent score fusion techniques are reported in Section VIB.

II. RELATED WORK

In this section, we provide pointers to some relatedapproaches in literature. We also discuss some available data-bases for matching NIR with VIS face images later.

A. Re-Ranking Approaches

Several re-ranking approaches have been proposed forthe application of person re-identification (re-ID) [51].Zheng et al. [50] propose a late fusion method at score levelto address the task, while Leng et al. [21] propose utilizingcontext and content similarity information. Ye et al. [47] useboth similarity and dissimilarity cues to develop an optimizedranking framework. A fusion based re-ranking method whichexploits the diversity of high dimensional feature vector isaddressed in [48]. Sarfraz et al. [38] incorporate fine andcoarse pose information of a person to learn a discriminativeembedding used for automatic re-ranking. Bai et al. [2]propose a framework to improve the performance of objectretrieval with diffusion process. A sparse contextual activationthat encodes the local distribution of an image by consideringthe pairwise distance in contextual space is described in [1].A re-ranking strategy which takes into account the relationof both probe and gallery with the reference image set isdescribed in [30]. We utilize the concepts of strongly similar,dissimilar and neutral galleries from [8] and dual rank listsfrom [47] to design our own novel re-ranking framework forthe LR heterogeneous face recognition problem.

B. Heterogeneous and Low-Resolution Face Recognition

Here, we discuss some recent works related to heteroge-neous and low-resolution face recognition. Canonical Corre-lation Analysis [12] and its variants [25], [34] are proposedto match the samples across any two modalities. GeneralizedMultiview Analysis [40] solves a joint, relaxed quadraticconstraint to obtain a common nonlinear discriminative sub-space. A compact binary face descriptor that computes the

descriptors from the pixel difference vectors in each localpatch of the image is described in [6] and [27]. Wu et al. [45]describe a light CNN framework which learns a compactembedding on the large-scale face data with massive noisylabels. This work does not need one-to-one paired data.An extensive survey on heterogeneous face recognition ispresented in [32]. Reale et al. [35] propose three differentmethods to improve heterogeneous face recognition involvingremoval of extra information from pre-trained networks, uti-lizing an altered contrastive loss function and improving theinitialization network. A deep multi-task learning approach tojointly estimate multiple heterogeneous face attributes froma single image is discussed in [10]. Modality invariant deepfeatures are learned by orthogonal subspace decompositionand max out operations at high-level representations of CNNsin [11]. Lezama et al. [22] use cross-spectral hallucination andlow-rank embedding for this task. A coupled deep learningframework for heterogeneous face recognition involving tracenorm and block diagonal constraints is proposed in [46].Klare et al. [18] propose to match cross-modal faces by com-puting kernel similarities with a collection of prototype faces.A framework to match NIR and VIS faces using transduction isdescribed in [52]. A single hidden-layer Gabor-based networkis proposed in [31] to recognize cross-modal faces.

We now discuss some recent works on LR face recognition.Hu et al. [13] propose to train CNNs with the availablesmall number of face samples by generating synthetic images.Roy and Bhattacharjee [37] propose an illumination invariantfeature describing the effect of the center pixel on its neigh-borhood for heterogeneous face recognition. Jin et al. [14]propose a coupled feature learning method to simultaneouslyexploit discriminative information and to reduce the appear-ance difference. Chu et al. [5] investigate LR face recognitionwith single sample per person and propose a cluster basedregularized simultaneous discriminant analysis method.

III. DALIGN: PROPOSED DICTIONARY

ALIGNMENT APPROACH

In this section, we present our proposed framework con-taining orthogonal dictionary alignment method and re-rankingapproach for matching data points from two different domains.Let us consider that, X = {x1, x2, . . . , xN1 } ∈ Rd×N1 and Y ={y1, y2, . . . , yN2 } ∈ Rd×N2 be the data in the two domains,where N1 may not be equal to N2. Our approach does notrequire one-to-one paired data for training unlike other coupleddictionary learning approaches [44]. The training stage hasthe following steps: 1) Learn separate orthogonal dictionariesfor the two domains and compute the correspondence of thedictionary atoms; 2) Align the dictionaries to minimize theshift between the two domains and make the sparse coefficientsdiscriminative for recognition task. We now describe each stepin details below.

A. Orthogonal Dictionary Learning and Correspondence

In this work, we propose to learn domain specific orthogonaldictionaries which do not require one-to-one paired datathereby reflecting practical scenarios. In our work, duringtraining, data in the two domains (VIS and NIR) are used to

Page 3: Dictionary Alignment With Re-Ranking for Low-Resolution ... · space. A compact binary face descriptor that computes the descriptors from the pixel difference vectors in each local

888 IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, VOL. 14, NO. 4, APRIL 2019

learn orthogonal dictionaries of their corresponding domainsindependently, thus the input data need not be paired. There areseveral advantages of learning orthogonal as compared to over-complete dictionaries as explained in [3]. In addition, anotheradvantage of our approach is that since the dictionaries spancorresponding space in the two domains and are orthogonal,the dictionary atoms can be aligned to minimize the shiftbetween the two domains. The sparse coefficients computedfrom the aligned dictionaries can then be directly compared.

In our work, we use the orthogonal dictionary learningframework proposed by Bao et al. [3]. Given the data fromone domain X, we learn an orthogonal dictionary D̄x = [Ax

Dx ] such that D̄Tx D̄x = Id×d . Here, d is the size of the input

feature vector. The optimization function that can learn theorthogonal dictionary is formulated as follows:

minDx ,�

‖ X − [Ax , Dx ] �‖22 + α ‖ �‖2

0

s.t. DTx Dx = Im, AT

x Dx = 0. (1)

Here m is the number of atoms in dictionary Dx . The dictio-nary D̄x has two sub dictionaries, where Ax can be used tohave a control on the required number of orthogonal dictionaryatoms, and Dx represents the orthogonal dictionary atomslearned from the input data X. We have set Ax to a nullmatrix so that D̄x = Dx and initialized Dx to DCT matrixof size d × d . During the t th iteration, we have Dt

x so thatthe sparse vectors �t

x can be updated as follows

�̂tx = arg min

�tx

∥∥ X − [

Ax , Dtx

]

�tx

∥∥

22 + α

∥∥ �t

x

∥∥

20

s.t. Dt T

x Dtx = Im , AT

x Dtx = 0. (2)

The problem in (2) has a unique solution given by [3]

�̂tx = Tε(D̄t T

x X) (3)

Here Tε(v) represents a hard threshold operation on thevector v. The objective function to update the dictionary att th iteration can be formulated as

D̂tx = arg min

Dtx

∥∥∥ X − [

Ax , Dtx

]

�t−1x

∥∥∥

2

2

s.t. Dt T

x Dtx = I, AT

x Dtx = 0. (4)

The above objective function has a unique solution and is givenby D̂

tx = UVT . The matrices U and V are orthogonal matrices

which can be obtained from Singular Value Decompositiongiven by (Id −PA)X�T

x D = U�VT . Here, �Tx D is the sparse

coefficient vector at iteration (t −1) associated with dictionaryDx and PA is an orthogonal projection operator defined byPAu = A(AT u), ∀ u ∈ Rd . Using the above formulation,we separately compute two dictionaries Dx and Dy for thedata X and Y from the two domains.

Since both the dictionaries have m orthogonal atoms com-puted from data of same classes (in our case, m = d sincewe set Ax to a null matrix), they should span the same space.But, since they are learnt separately, there may not be anycorrespondence between the dictionary atoms of Dx and Dy ,

i.e. i th dictionary atom of Dx may not correspond to the i th

atom of Dy . So now the goal is to find the correspondingdictionary atoms, so that they can be aligned. So we transformthe columns of both the dictionaries to a common spaceby using cluster CCA [34] learned from the training data.Cluster CCA is used since it only requires correspondencebetween clusters of data in the two domains and not one-to-one correspondence between the data points. Then BipartiteGraph Matching [4] is used to learn the correspondence ofthe dictionary atoms in the transformed space. Given a set ofm dictionary atoms from Dx (denoted as di

x , i = 1, . . . , m) andm dictionary atoms from Dy (denoted as d j

y , j = 1, . . . , m),the goal is to establish the correspondence between them. LetCij be the cost of matching the two vectors di

x and d jy . The

objective function of the bipartite graph matching is as follows

H(π) =∑

i

C(diy, dπ(i)

x ) (5)

The above objective function is minimized which gives thecorrespondence between the dictionary atoms of the twodomains. π(i) is essentially a permutation operation, whichis used to permute the atoms of one dictionary such thatthe dictionary atoms of the two domains have one-to-onecorrespondence. Note that, the dictionaries are transformedto the common space only to apply bipartite graph match-ing. Once the correspondence matching π(i) is obtained,we apply the permutation on the original dictionary Dx . Letthese dictionaries be denoted by Dc

x and Dcy , where Dc

x isobtained by permuting the atoms of Dx according to π(i), andDc

y = Dy .

B. Dictionary Alignment, Discriminative Coefficients

Once we compute the permuted dictionaries Dcx and Dc

ywhich have one-to-one correspondence between their columns,we need to address the domain shift of the dictionaries.We reduce this domain shift by aligning one of the dictionarywith respect to the other so that we can have a common sparserepresentation for cross domain data. Inspired by the successof subspace alignment approach [7], we learn a mappingfunction T on Dc

x so that T : Dcx → Dc

y . We formulate theobjective function to learn the mapping function T as below

T̂ = arg minT

∥∥∥ Dc

xT − Dcy

∥∥∥

2

2(6)

By solving the above objective function, the optimum valueof T can be derived as a closed form solution as T̂ =Dc

xT Dc

y . Thus, the x-domain dictionary which is aligned to they-domain is given by Dc,a

x = Dcx Dc

xT Dc

y .Though the dictionaries are aligned to reduce the domain

shift, the corresponding sparse coefficient vectors need tobe made discriminative so that they can be used for therecognition task. The data from the X and Y domains arefirst used to compute their sparse coefficients as follows

arg min�x

∥∥ X − Dc,a

x �x∥∥2

2 + α ‖ �x‖20

arg min� y

∥∥∥ Y − Dc

y�y

∥∥∥

2

2+ α

∥∥ �y

∥∥2

0 (7)

Page 4: Dictionary Alignment With Re-Ranking for Low-Resolution ... · space. A compact binary face descriptor that computes the descriptors from the pixel difference vectors in each local

MUDUNURI et al.: DICTIONARY ALIGNMENT WITH RE-RANKING FOR LOW-RESOLUTION NIR-VIS FACE RECOGNITION 889

Fig. 2. Illustration of the proposed re-ranking framework.

To make the sparse coefficients discriminative, we applythe metric learning approach KISSME [19] on the sparsecoefficients of the two domains so that the coefficient vectorsof the same class moves closer and of different classesmove apart. Please refer to [19] for further details of thealgorithm.

IV. PROPOSED RE-RANKING APPROACH AND TESTING

During the testing stage, the probe and the gallery featuresrepresented by the discriminative sparse coefficient vectors arecomputed, with which a rank list is obtained. The rank listgives the gallery data with increasing distance from the probe.The proposed re-ranking algorithm as illustrated in Fig. 2 usesthis rank list along with that given by any other approach tocompute an improved rank list, which is then used to predictthe probe class. For discussion, we will refer to the proposedDAlign as Algo 1 and the other approach as Algo 2. Theproposed re-ranking algorithm is general and can be used withany two algorithms.

Consider that the probe is denoted as p and the rankedlist of gallery retrieved from Algo 1 and Algo 2 are denotedas RL1 = {g1

1, g12, g1

3 . . . . . . ..g1N } ∈ Rd×N and RL2 =

{g21, g2

2, g23 . . . . . . ..g2

N } ∈ Rd×N respectively. Here, d is thefeature dimension and N is the total number of galleryimages. The rank list RLt (t ∈ {1, 2}) satisfies the conditiond(p,gt

1) < d(p,gt2)< . . . .. <d(p, gt

N ), where, d(p, gti ) (i ∈

{1, 2, 3, . . . , N}) denotes the Euclidean distance between theprobe p and the i th gallery data. It is important to note that ingeneral, g1

i may or may not be the same as g2i . Using the probe

to retrieve the gallery elements is referred to as forward query.The proposed re-ranking approach has two main steps. First,

given a probe, based on the arrangement of the gallery datain the two rank lists, we divide them into three sets, namelystrongly similar, strongly neutral and strongly dissimilar withrespect to the given probe. Second, using the galleries in thestrongly similar set, we refine Algo 2 rank list using backwardrequery. This refined rank list is fused with that of Algo 1 usingweighted sum to compute an improved rank list which isfinally used to predict the probe class.

A. Strongly Similar, Dissimilar and Neutral Gallery

Now, we discuss the procedure to compute the stronglysimilar, strongly neutral and strongly dissimilar gallery sets.

1) Strongly Similar Score: We assume that for both theapproaches, even if the correct match is not in rank-1, it willbe there in the top k retrieved results in RL1 and RL2. Thisis true for most of the algorithms and is assumed to be truefor most re-ranking approaches [21], [24]. For example, forDAlign, even though the rank-1 accuracy is 58.38% on themodified CASIA NIR-VIS 2.0 database, it improves to 80.88%for rank-5 and 87.52% for rank-10, which is further discussedin Section VI. Thus if a gallery is present in the top k retrievedresults in both the rank lists, i.e. it lies in the intersection ofthe top k elements of the two rank lists, it is likely to be thecorrect match for the particular probe and is considered asan element of the strongly similar gallery set. Let S1

k+ (p) andS2

k+ (p) denote the top k gallery elements for the given probe pin RL1 and RL2 respectively, then the strongly similar galleryset is given by

G+(p) = S1k+ (p) ∩ S2

k+ (p) (8)

Here, G+(p) = {g1+, g2+, g3+ . . . .} denotes the strongly sim-ilar gallery elements which is likely to contain the correctmatch for the particular probe and is thus treated as the newgallery set from this point onwards. Now, each element inthis new gallery gi+ ∈ G+(p) is used as a pseudo-probeafter replacing this gallery element with the original probewhich is known as backward requery inspired from [49]. Forexample, when we use g1+ as the pseudo-probe, the galleryconsists of {g2+, g3+, . . . ., p}, i.e. all the elements of G+(p)without g1+, in addition to the original probe. This is basedon the assumption that if gi+ is the correct gallery for theprobe p, then p should also show up in the top few retrievedelements when gi+ is used as the pseudo-probe for requery.This assumption has been used in the re-ranking task withgreat success in [47]. This process is similarly repeated for allother gi+ ∈ G+(p).

The modified similarity between gi+ and the original probep is computed from the rank (or position) in the two scenarios

Page 5: Dictionary Alignment With Re-Ranking for Low-Resolution ... · space. A compact binary face descriptor that computes the descriptors from the pixel difference vectors in each local

890 IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, VOL. 14, NO. 4, APRIL 2019

(forward query and backward requery) as follows

SimiSc = γ

R(gi+, p) × R(p, gi+)(9)

Here, R(p, gi+) denotes the rank (or position) of gi+ duringforward query, and R(gi+, p) denotes the rank of the probe pduring backward requery. For the correct gallery, both theseranks should be low, which will result in a high similarityscore of the gallery for that probe. We experimentally setk to 40 and γ , which is the weighing factor to 0.001.

2) Strongly Dissimilar Score: The strongly dissimilargallery score is used to penalize the gallery elements in thenew gallery set G+(p), which we believe are not a suitablematch for the probe. The score assigned to the dissimilargallery elements in this set pushes it away from the probe tooutput an improved rank list. The strongly dissimilar galleryset is obtained using the intersection between the last k galleryelements of RL1 and RL2. If S1

k−(p) and S2k− (p) denote the

last k gallery elements in RL1 and RL2 respectively, then thestrongly dissimilar gallery set is formulated as

G−(p) = S1k− (p) ∩ S2

k− (p) (10)

Here G−(p) = {g1−, g2−, g3− . . . .} denotes the strongly dissim-ilar gallery elements, which are used to penalize the stronglysimilar gallery elements gi+ ∈ G+(p) which are not likely tobe a correct match for the given probe.

We assume that if a gallery gi+ is the correct match forthe probe p, then the number of common dissimilar elementswhen probe p is used for forward query and when gi+ isused for backward requery will be high. When probe p isused for forward query, instead of taking the last k retrievedelements as the dissimilar elements, we take the stronglydissimilar set since these are more likely to be the correctdissimilar elements. So, we requery gi+ back into the ranklist obtained by Algo 1, and let the last k gallery elementsbe denoted as G′−(p). We then find the number of commonelements between G′−(p) and G−(p) which is used to assignthe dissimilarity score to gi+ ∈ G+(p) as follows

DissimiSc = α1 ∗ e−|G−(p)∩G ′−(p)| (11)

where |.| denotes the cardinality and α1 represents the weigh-ing factor which we experimentally set to 1.0. Higher numberof common dissimilar elements indicates that the galleryis likely to be a correct match for the probe and so thedissimilarity score is smaller, and vice versa.

3) Strongly Neutral Score: The strongly neutral galleryelements are computed so as to lightly penalize the galleryelements in G+(p), which we believe are neither stronglysimilar nor dissimilar to the probe. We obtain these elementsusing the intersection of the intermediate k gallery elementsof RL1 and RL2. Using the same assumption and the sameprocess as we followed for the strongly dissimilar score,we assign the neutral score for gi+ denoted as Neuti

Sc.

B. Final Rank List Using Fuzzy Fusion

Using these computed scores, we refine the rank listobtained from Algo 2. First, we use fuzzy aggregation oper-ation to fuse these scores for gi+ resulting in a distance

score [48]

f useiSc = 1 − (Simi

Scα + Neuti

Scα + Dissimi

Scα)1/α (12)

Fuzzy aggregation is an averaging function which is an effec-tive tool to fuse multiple scores. With variations in α, thisoperator can be converted into arithmetic mean or geometricmean. Here, all scores have been given equal weights wherewe experimentally set α = 1.9. Thus, each element in thestrongly similar gallery set is assigned this distance score( f useSc) with which we aggregate its corresponding distancescore d(p, g2) from RL2 since our goal is to refine rank listof Algo 2. This aggregation gives a modified distance scorefor these gallery elements in the strongly similar gallery setas shown below

re fSc = (( f useSc)α + d(p, g2)α)1/α (13)

Thus, all g+ ∈ RL2 is updated with re fSc, while scores of thegallery elements which do not belong to the strongly similargallery set remains unchanged. This results in a refined ranklist for Algo 2 which is denoted as RLnew . Finally, for eachgallery element, the distance scores from RLnew and RL1 arecombined using weighted sum as follows:

f inalSc = λd(p, g1) + (1 − λ)re fSc (14)

Here, λ is the weight factor which we experimentally set as0.7. This distance score f inalSc is used to generate the ranklistRL f inal which is used to predict the probe class.

V. DATABASE DESCRIPTION

In this section, we describe the different databases availablewhich address the challenging problem of matching NIR withVIS faces along with the details of our collected database.

A. Available Databases for HeterogeneousFace Recognition

Li et al. [23] collected the CASIA NIR-VIS 2.0 databasecontaining 725 subjects captured under controlled indoorsettings with 1-22 VIS and 5-50 NIR high-resolution faceimages per subject. Since it contains large number of subjects,it is one of the most widely used datasets for heterogeneousface recognition. But it does not contain surveillance qualityimages having low-resolution and large variations in pose.Grgic et al. [9] collected the Surveillance Cameras facedatabase (SCface) which contains images taken in uncontrolledindoor scenario using three different NIR cameras. Thereare a total of 4,160 static images from both NIR and VISspectrum of 130 subjects. Though this dataset is very usefulfor addressing the surveillance quality scenario, the number ofsubjects is not too high and also it is difficult to systematicallystudy the effect of low-resolution and pose on the recognitionperformance. Kang et al. [15] propose the Long DistanceHeterogeneous Face Database (LDHF-DB) comprising of bothVIS and NIR face images. It contains 100 subjects, where foreach subject, one frontal image without glasses is captured ateach distance during day and night time. In practical scenarios,we encounter non-frontal images with variations in peoplewearing glasses of different styles which adds to the challenge

Page 6: Dictionary Alignment With Re-Ranking for Low-Resolution ... · space. A compact binary face descriptor that computes the descriptors from the pixel difference vectors in each local

MUDUNURI et al.: DICTIONARY ALIGNMENT WITH RE-RANKING FOR LOW-RESOLUTION NIR-VIS FACE RECOGNITION 891

TABLE I

SUMMARY OF THE AVAILABLE DATABASES ALONG WITH THE PROPOSED HPR DATABASE FOR MATCHING NIR AND VIS FACE IMAGES

Fig. 3. Sample video frames from one video from our HPR database captured by NIR camera.

in matching NIR faces with VIS faces. Maeng et al. [28] pro-pose a Near-Infrared Face Recognition at a Distance Database(NFRAD-DB) consisting of NIR images with 50 subjects.In addition to small number of subjects, the images havesubtle variations in pose. Singh et al. [41] introduce the cross-spectral, cross-resolution video database with 160 subjectsconsisting of videos captured under uncontrolled settings forboth NIR and VIS spectrum. Bourlai et al. [42] introduce theSWIR (Short-wave infrared), MWIR (mid-wave infrared) andNIR Mid-Range datasets. These three datasets are capturedwith 50, 50 and 103 subjects comprising of images with frontalposes. A summary of these databases is given in Table I.

B. Proposed HPR Database

Taking into account the advantages and limitations of theavailable databases, we introduce a new database termed asHPR keeping two goals in mind, (1) It should be able toreflect the practical surveillance scenario and (2) It shouldhelp in systematic study of the effect of resolution and posein the recognition performance. The database has images from200 subjects captured with two surveillance quality cameraswhich can work under low light conditions in NIR mode andone HR VIS camera which captures mugshot photos.

C. Data Collection Details

For data collection, we use two NIR cameras, namely, CPPlus PTZ Camera with 2 MP resolution (NIR Cam01) and HIKVision Bullet Camera with 3 MP resolution (NIR Cam02).A 54 array NIR LED illuminator is mounted on the sameplatform in between the two NIR cameras for providing activelighting source in the dark. Canon PowerShot Pro Series S5 IS8.0MP digital camera is used for capturing HR VIS images.The VIS images are captured under normal lighting conditionsbut the NIR images are captured by switching off all the lightsources, with the NIR illuminator on. We take the HR VISmugshot gallery images of subjects at 4ft. distance from the

camera at seven poses (Fig. 1), which accounts to a totalof 1400 HR VIS images. The NIR still images are capturedwhen the subjects stand at three different distances (8ft., 12ft.and 16ft.) from the two NIR cameras at seven different posesto systematically study the effect of distance and pose onthe performance, thus accounting to a total of 4200 imagesfrom each camera. Fig. 1 shows images taken from the firstcamera for different poses and distances. The resolution of theface images decreases as the subject moves further from thecamera allowing us to study the effect of varying resolutionon the performance. For the NIR videos, the subjects performdifferent tasks like drinking water, reading a book, reachingout for something from a cupboard and looking at the watch.They are not directed to look at specific locations and they arefree to do the task as they want, which results in significantvariations in the videos, thus reflecting the practical scenarios(Fig. 3). Most of the volunteers for our database are universitystudents and the data capture protocol was approved by theInstitute Human Ethics Committee.

Once the images are captured, Viola Jones face detec-tion technique [43] is used to detect and crop the faceregion in the image, which are then manually verified andcorrected. The summary of the proposed database is pre-sented in Table I. The database has the following interestingpoints:

• It has real LR NIR images and HR VIS imagesof 200 subjects to address the problem of low-resolutionheterogeneous face recognition with extreme poses whichare very difficult to recognize.

• The database is useful for systematic evaluation andunderstanding of how variations in pose and resolutionaffect the performance of heterogeneous face recognition.

• It also has videos of 200 subjects captured while they areperforming various tasks.

• Experimental protocols are defined for our database andare benchmarked with available methods in literature.

Page 7: Dictionary Alignment With Re-Ranking for Low-Resolution ... · space. A compact binary face descriptor that computes the descriptors from the pixel difference vectors in each local

892 IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, VOL. 14, NO. 4, APRIL 2019

Fig. 4. Sample images from the modified CASIA NIR-VIS 2.0 database.Row 1: HR VIS gallery images; Row 2: Corresponding LR NIR images.

VI. EXPERIMENTAL EVALUATION

Extensive experiments are conducted to evaluate the useful-ness of the proposed algorithms for the task of low-resolutionheterogeneous face recognition task along with analysis onthe collected dataset. We first evaluate the performance ofmatching synthesized LR NIR images with HR VIS facesusing the modified CASIA NIR-VIS 2.0 database [23]. Then,we evaluate the performance of the algorithms on real LRimages of the SCface database [9] and our database. Theproposed re-ranking approach is evaluated by combiningDAlign with the other algorithms to justify its usefulness.In our experiments, we concatenate pool5, fc6 and fc7features from VGG face network [33] and apply PCA toreduce the dimension to 2500 which is used as the featurevector [29].

A. Experiments on Modified CASIA NIR-VIS 2.0 Database

We choose CASIA NIR-VIS 2.0 database [23] since itis the largest publicly available database with 725 subjects.The HR VIS images are of size 128×128 while the NIRimages are downsampled to 20×20 and then upsampled backto the original size using bicubic interpolation to simulatethe LR images. Sample images of the database used in ourexperiments are shown in Fig. 4. We follow the standardprotocol of the database and report the mean recognitionrate and standard deviation over 10 folds for the proposeddictionary alignment (DAlign) approach. The comparisonswith other approaches are reported in Table II. We observe thatthe performance of DAlign is 58.38%, which is higher than allthe other compared approaches. But this is considerably lowerthan the performance of 87.45%, when the probe images wereHR as reported in [29], which illustrates the difficulty of theproblem addressed in this work.

We also evaluate the proposed re-ranking approach usingDAlign along with other compared approaches taken asAlgo 2 whose results are reported in Table II. We observethat the proposed re-ranking framework is able to consistentlyoutperform both DAlign and the other baseline. For thisexperiment, the best rank-1 performance of 68.30% is obtainedby applying re-ranking on DAlign and CBFD. We compareour approach with recent state-of-the-art re-ranking algorithms[2], [30], [38] and the results are reported in Table II.

During testing, for a given probe feature, it takes 0.1022 and0.5331 seconds to identify the probe ID using DAlign andCBFD respectively from 358 gallery images. Given theirscores, the re-ranking takes 1.5294 seconds to output the probeclass. This is the CPU computational time on a machine with

TABLE II

EVALUATION (%) ON THE MODIFIED CASIA NIR-VIS 2.0DATABASE [23]. NOTATION: (A+B) DENOTES THE

PERFORMANCE OF THE RE-RANKING APPROACH

USING THE APPROACHES A AND B

32GB DDR4 RAM, 2.4GHz clock frequency and 64bit OSusing an unoptimized MATLAB code.

B. Analysis of the Proposed Approach

Here, we analyze the proposed algorithms on fold 3 ofthe modified CASIA NIR-VIS 2.0 database and take the bestperforming methods from Table II. The rank-1 recognitionon fold 3 for DAlign, randKCCA, CBFD and CA-LBFL are57.56%, 45.60%, 42.80% and 46.29% respectively.

1) Comparison With Score Fusion Techniques: Here,we illustrate the effectiveness of the proposed re-rankingapproach as compared to some of the standard score basedfusion techniques. We use four fusion techniques, namelyWeighted Sum, Max. Fusion [17], Min. Fusion [17] andProduct Fusion [50] to fuse the similarity scores for com-parison. We observe from Table III that the proposed re-ranking approach combines the scores effectively to give betterperformance than the standard fusion techniques.

2) Effect of Different Terms: Here we analyze the effectof each of the following terms, namely, SimSc (S1),DissimSc (S2) and NeutSc (S3) on the performance of theproposed re-ranking framework using DALign with CBFD asthe two baselines. We conduct an experiment on fold 3 of themodified CASIA NIR-VIS 2.0 database by adding one term ata time. We obtain a rank-1 performance of 65.99% using S1alone, adding S2 improves it to 66.17%, and finally combiningS3 along with S1 and S2 gives the highest performanceof 67.80%. We observe that, each term helps in improving the

Page 8: Dictionary Alignment With Re-Ranking for Low-Resolution ... · space. A compact binary face descriptor that computes the descriptors from the pixel difference vectors in each local

MUDUNURI et al.: DICTIONARY ALIGNMENT WITH RE-RANKING FOR LOW-RESOLUTION NIR-VIS FACE RECOGNITION 893

TABLE III

COMPARISON OF THE PROPOSED RE-RANKING ALGORITHM (RANK-1%) WITH STANDARD FUSIONTECHNIQUES ON FOLD-3 OF THE MODIFIED CASIA NIR-VIS 2.0 DATABASE [1]

Fig. 5. Effect of applying SR [20] on the LR NIR faces before performingthe recognition.

recognition performance, though the most significant improve-ment is given by the first term.

The proposed re-ranking approach is also able to effectivelyutilize multiple scores from more than one secondary algo-rithm to further improve the result. We perform an experi-ment by taking DAlign as Algo1 and three other algorithms,randKCCA, CBFD and CA-LBFL as secondary baseline algo-rithms. We observe that the rank-1 recognition rate improvesto 70.30%, while rank-5 recognition rate increases to 88.19%and rank-10 recognition becomes 92.94%. But all the resultsin the paper are with one secondary algorithm.

3) Effect of Super-Resolution: One standard way of match-ing LR with HR images is to first apply super-resolution (SR)on LR image to synthesize the corresponding HR image andthen perform matching. For this task, we apply the state-of-the-art deep SR technique proposed by Lai et al. [20] on theLR images to synthesize the corresponding HR images. In thiswork, a deep Laplacian pyramid network based framework isproposed [20] to progressively reconstruct sub-band residualsof the HR images. Since the trained models for scaling factorsof 2 and 4 are provided, we use LR NIR input images ofresolution 20×20, 30×30 and 60×60. For 30×30 and 60×60,we use the scaling factor of 2 and 4 directly to upsamplethem back close to its original resolution. For image resolutionof 20×20, we apply the SR technique with scaling factor of 4to obtain a 80×80 SR image, which is then resized back tothe original resolution of 128×128.

For this experiment, we chose DAlign and CBFD [27]as the baselines for the proposed re-ranking approach. Weobserve from Fig. 5 that the SR technique is able to improvethe rank-1 performance of the individual baseline algorithms.

Fig. 6. Few examples of hallucinated faces on CASIA NIR-VIS 2.0 database.Top Row: Output of SR [20] on LR images of size 112×112, Second row:Corresponding hallucinated images after applying CSH [22].

We also observe that the SR images helps to further boostthe performance of the proposed re-ranking approach. Forcomparison, when HR NIR images are used as probe and HRVIS images are used as gallery, we obtain rank-1 performanceof 84.90%, 72.68% and 92.01% with DAlign, CBFD andDAlign with CBFD respectively.

4) Effect of Face Hallucination: Here, we compare theproposed approach with the state-of-the-art heterogeneous facerecognition approach which uses cross-spectral hallucinationand low-rank embedding [22] referred to as CSH. For thisexperiment, as in [22], we use the three channel RGB and NIRimages of the database [23] which are cropped to 224×224 andaligned using the fiducial locations as discussed in [16]. Here,we want to analyze two things, (1) How does our approachcompare with the state-of-the-art deep neural network basedtechnique [22] and (2) Can the proposed re-ranking approachuse this technique along with DAlign to give a performancewhich is better than both the approaches.

For the experiment, we downsample the original NIRimages to create the LR NIR probe images of resolution112×112. (We also tried with image resolution of 20×20, butthe hallucinated images obtained when the super-resolutionimages using a scaling factor of 8 were given as input weredistorted and the performance was quite low.) We then applythe state-of-the-art SR technique [20] with a scale factor of 2 tocreate the HR images and then apply the cross-spectral hallu-cination method [22]. Sample LR NIR images and their corre-sponding hallucinated images are shown in Fig. 6. We observethat for the LR image resolution of 112×112, the hallucinatedimages look quite natural. Then the VGG Face network isused to extract the features of 1024d (Feat 1) as discussedin [22] after which low-rank embedding is applied to obtain therecognition performance. We directly use the code providedby Lai et al. [20] and Lezama et al. [22]. We observe from

Page 9: Dictionary Alignment With Re-Ranking for Low-Resolution ... · space. A compact binary face descriptor that computes the descriptors from the pixel difference vectors in each local

894 IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, VOL. 14, NO. 4, APRIL 2019

TABLE IV

RANK-1 RECOGNITION RATES (%) OBTAINED FOR TWO DIFFERENTFEATURES USING CSH [22] AND DALIGN

TABLE V

RANK-1 RECOGNITION (%) ON SCFACE DATABASE [9]

Table IV (Row 1) that the proposed DAlign approach gives abetter rank-1 recognition performance compared to the state-of-the-art approach for LR heterogeneous face recognition. Wealso observe that the proposed re-ranking approach is able tocombine the rank lists given by DAlign and CSH to furtherimprove the recognition performance.

We repeat the same experiments with our feature (byconcatenating pool5, fc6 and fc7 followed by PCA) denotedby Feat 2, and the results are reported in Table IV (Row 2).We observe similar pattern in the results, except that theseresults are significantly higher as compared to those usingFeat 1.

C. Experiments on SCface Database

In this section, we present the experimental evaluation onthe SCface database [9] to illustrate the effectiveness of theproposed approaches on surveillance quality heterogeneousface images. We randomly take 50 subjects for training andthe remaining 80 subjects for testing. Both the VIS andNIR images are resized to 224×224 to compute the feature(Feat 2). There are images from two IR cameras capturedat three distances. We conduct the experiment on imagesfrom both cameras together under each distance separately toillustrate the effect of distance variations on the recognitionperformance. The frontal HR VIS images are taken as galleryand the NIR images from both the cameras are taken asprobe. Here, the probe images are themselves of surveillancequality.

We observe that both CBFD [27] and CA-LBFL [6] per-form poorly on these images. This may be because theseapproaches work on pixel difference vectors and the imagesare of poor quality. We also examine the performance ofthe other approaches reported in Table II, but all of themperform poorly (around 5%). So for the re-ranking algorithm,instead of taking two approaches, we explore whether differentfeatures with the same algorithm can be used as inputs tothe re-ranking approach. We conduct experiments with theproposed DAlign using two different image features, Feat 2and rootSIFT [26] (Feat 3). We consider DAlign withFeat 2 and Feat 3 features as Algo 1 and Algo 2 respectivelyfor the re-ranking approach. We report the rank-1 recognitionaccuracies (%) for the three distances in Table V. Interestingly,

the performance of Algo 1 is less than Algo 2 for Dist 1,same as Algo 2 for Dist 2 and better than Algo 2 for Dist 3.But the proposed re-ranking approach is able to successfullyimprove the recognition performance as compared to bothAlgo 1 and Algo 2 for matching LR NIR images with goodquality VIS images.

D. Experiments on the proposed HPR Database

In this section, we provide details of the different evaluationprotocols and the results obtained by the different state-of-the-art algorithms on our database. We provide three challengingprotocols by which we can systematically analyze the effect ofvariations in head pose and the distance between the cameraand subject (or resolution) on the recognition performance.In all our experiments, we consider the HR VIS faces takenunder frontal pose as the gallery. Thus, both during trainingand testing, there is only a single image per subject for theVIS domain. Out of the 200 subjects, 50 subjects are takenfor training and the remaining 150 subjects are consideredfor testing. Though we have landmark points for each facialimage in our database, we did not use this information forour experiments. We use rank-1 recognition rate (%) as theperformance metric for all the experimental protocols on ourdatabase. For this dataset also, the best algorithms in Table II,namely CBFD [27] and CA-LBFL [6] performed poorly. Sowe use randKCCA as Algo 2 to demonstrate the performanceof our re-ranking framework. In this work, all the experimentsare on the still images.

1) Protocol 1 (Effect of Varying Resolution): This protocolis designed to analyze how the distance between cameraand the subject affects the recognition performance. For thisprotocol, all the poses captured at a particular distance byboth the NIR cameras are considered as the probe images. Theresults of the proposed dictionary alignment approach and theproposed re-ranking approach (taking randKCCA as the otherbaseline) is shown in Table VI for different distances. Weobserve that the performance drops as the distance betweencamera and subject increases. For example, for the proposedre-ranking approach, the performance drops from 60.62% to56.91% when the distance increases from 8 ft. to 12 ft., andfurther drops to 50.52% when the distance increases to 16 ft.This is because, as the subject moves far from the camera,the size of face becomes smaller and hence the availablediscriminative information in the face is less, thereby resultingin lower recognition performance.

2) Protocol 2 (Effect of Pose variations): This protocolis designed to analyze the variations of head pose on therecognition performance. For a particular distance (say 8 ft.),we take one pose (say Pose00) captured from both the NIRcameras as probe images. The experiment is repeated for eachdistance and each pose to observe the effect of pose variations.We also consider an even more challenging setting, wherewe fix the pose and take all the images captured at all thethree distances together as probe. We repeat this setting overall the poses. Each of the first three rows in each blockin Table VI shows the performance of the correspondingalgorithm for the seven different head poses for a particular

Page 10: Dictionary Alignment With Re-Ranking for Low-Resolution ... · space. A compact binary face descriptor that computes the descriptors from the pixel difference vectors in each local

MUDUNURI et al.: DICTIONARY ALIGNMENT WITH RE-RANKING FOR LOW-RESOLUTION NIR-VIS FACE RECOGNITION 895

TABLE VI

RANK-1 RECOGNITION (%) FOR THE THREE PROTOCOLS ON THE HPR DATABASE

distance. The fourth row shows the performance for all thethree distances combined for the different poses. As expected,the performance is best if the probe has frontal pose (Pose00)and captured from nearby distance (8 ft.).

3) Protocol 3 (Effect of All the Poses and DistancesTogether): In real surveillance scenarios, there is usually norestriction on the pose and distance of the subjects from thecamera. So the most realistic and also the most challengingprotocol is when the probe consists of all the seven posesand all the three distances together. The results are reportedin Table VI at “All Distance” vs. “All Poses”. We see thatthe performance of the dictionary alignment approach for thisprotocol is 50.16%. The proposed re-ranking approach is ableto improve the performance to 56.46%. So we see that a lotof work is still required to achieve satisfactory performancefor low-resolution heterogeneous face recognition task.

4) Effect of SR on Real LR Faces: We have observed thatSR techniques help to improve the performance for the LRNIR images. Inspired by this, we conduct another experimentto study the effect of SR techniques on real LR images. Forthis experiment, we take the NIR images at 16 ft. distance asLR images and apply SR technique [20] to reconstruct the HRimages before performing the matching. We observe that thereis almost no improvement in performance, which indicates thatimproving the matching performance of real LR images usingSR techniques is much more challenging compared to that ofsynthetic LR images.

5) Effect of Real vs. Synthesized LR Images: We performanother experiment in which we synthesize LR images bydownsampling the images at 8 ft. to the size of imagesat 16 ft. and computing the recognition accuracy with ourapproach. We then compare its recognition performancewith that of the real LR images at 16 ft. The rank-1performance on the synthesized (real LR) images usingDAlign, randKCCA and the re-ranking framework are 51.71%(44.49%), 35.19% (31.95%) and 56.46% (50.52%) respec-tively. Clearly, the synthesized images have significantly bet-ter recognition accuracy compared to the real LR images,which further illustrates the importance of working withreal LR images as compared to synthetically generatedLR images.

VII. CONCLUSION

In this work, we address the challenging problem of match-ing uncontrolled LR NIR images with HR VIS images in thedatabase, which will be very useful for low-light or night-time surveillance scenarios. We propose a dictionary align-ment approach for this task. We also propose a re-rankingapproach, which can effectively combine the dictionary align-ment approach with another approach to further improve theperformance. We further propose our own HPR databaseto systematically study the effect of head pose and imageresolution on the recognition performance. We observe fromthe experimental results that most of the existing algorithmsincluding the proposed one are significantly affected with non-frontal head pose of the probe image. One way to address thisis to utilize approaches which frontalize faces along with theproposed approach for improved performance on non-frontalfaces, and this will be part of our future work.

ACKNOWLEDGEMENT

The authors would like to thank Prof. K. R. Ramakrishnan(Retd. Professor, IISc) and Mr. Suresh Kirthi K. (IISc) forhelpful discussions.

REFERENCES

[1] S. Bai and X. Bai, “Sparse contextual activation for efficient visual re-ranking,” IEEE Trans. Image Process., vol. 25, no. 3, pp. 1056–1069,Mar. 2016.

[2] S. Bai, X. Bai, Q. Tian, and L. J. Latecki, “Regularized diffusion processon bidirectional context for object retrieval,” IEEE Trans. Pattern Anal.Mach. Intell., to be published.

[3] C. Bao, J.-F. Cai, and H. Ji, “Fast sparsity-based orthogonal dictionarylearning for image restoration,” in Proc. IEEE Int. Conf. Comput. Vis.,Dec. 2013, pp. 3384–3391.

[4] S. Belongie, J. Malik, and J. Puzicha, “Shape matching and objectrecognition using shape contexts,” IEEE Trans. Pattern Anal. Mach.Intell., vol. 24, no. 4, pp. 509–522, Apr. 2002.

[5] Y. Chu, T. Anand, G. Bebis, and L. Zhao, “Low-resolution facerecognition with single sample per person,” Signal Process., vol. 141,pp. 144–157, Dec. 2017.

[6] Y. Duan, J. Lu, J. Feng, and J. Zhou, “Context-aware local binary featurelearning for face recognition,” IEEE Trans. Pattern Anal. Mach. Intell.,vol. 40, no. 5, pp. 1139–1153, May 2018.

[7] B. Fernando, A. Habrard, M. Sebban, and T. Tuytelaars, “Unsupervisedvisual domain adaptation using subspace alignment,” in Proc. IEEE Int.Conf. Comput. Vis., Dec. 2013, pp. 2960–2967.

Page 11: Dictionary Alignment With Re-Ranking for Low-Resolution ... · space. A compact binary face descriptor that computes the descriptors from the pixel difference vectors in each local

896 IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, VOL. 14, NO. 4, APRIL 2019

[8] J. García, N. Martinel, A. Gardel, I. Bravo, G. L. Foresti, andC. Micheloni, “Discriminant context information analysis for post-ranking person re-identification,” IEEE Trans. Image Process., vol. 26,no. 4, pp. 1650–1665, Apr. 2017.

[9] M. Grgic, K. Delac, and S. Grgic, “SCface—Surveillance cameras facedatabase,” Multimedia Tools Appl., vol. 51, no. 3, pp. 863–879, 2011.

[10] H. Han, A. K. Jain, S. Shan, and X. Chen, “Heterogeneous face attributeestimation: A deep multi-task learning approach,” IEEE Trans. PatternAnal. Mach. Intell., to be published.

[11] R. He, X. Wu, Z. Sun, and T. Tan, “Learning invariant deep representa-tion for NIR-VIS face recognition,” in Proc. Assoc. Advancement Artif.Intell., vol. 4, 2017, pp. 2000–2006.

[12] H. Hotelling, “Relations between two sets of variates,” Biometrika,vol. 28, nos. 3–4, pp. 321–377, 1936.

[13] G. Hu, X. Peng, Y. Yang, T. M. Hospedales, and J. Verbeek, “Franken-stein: Learning deep face representations using small data,” IEEE Trans.Image Process., vol. 27, no. 1, pp. 293–303, Jan. 2018.

[14] Y. Jin, J. Lu, and Q. Ruan, “Coupled discriminative feature learning forheterogeneous face recognition,” IEEE Trans. Inf. Forensics Security,vol. 10, no. 3, pp. 640–652, Mar. 2015.

[15] D. Kang, H. Han, A. K. Jain, and S.-W. Lee, “Nighttime face recognitionat large standoff: Cross-distance and cross-spectral matching,” PatternRecognit., vol. 47, no. 12, pp. 3750–3766, 2014.

[16] V. Kazemi and J. Sullivan, “One millisecond face alignment with anensemble of regression trees,” in Proc. IEEE Conf. Comput. Vis. PatternRecognit., Jun. 2014, pp. 1867–1874.

[17] J. Kittler, M. Hatef, R. P. W. Duin, and J. Matas, “On combiningclassifiers,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 20, no. 3,pp. 226–239, Mar. 1998.

[18] B. F. Klare and A. K. Jain, “Heterogeneous face recognition using kernelprototype similarities,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 35,no. 6, pp. 1410–1422, Jun. 2013.

[19] M. Kostinger, M. Hirzer, P. Wohlhart, P. M. Roth, and H. Bischof, “Largescale metric learning from equivalence constraints,” in Proc. IEEE Int.Conf. Comput. Vis. Pattern Recognit., Jun. 2012, pp. 2288–2295.

[20] W.-S. Lai, J.-B. Huang, N. Ahuja, and M.-H. Yang, “Deep Laplacianpyramid networks for fast and accurate super-resolution,” in Proc. IEEEConf. Comput. Vis. Pattern Recognit., Jul. 2017, pp. 624–632.

[21] Q. Leng, R. Hu, C. Liang, Y. Wang, and J. Chen, “Person re-identification with content and context re-ranking,” Multimedia ToolsAppl., vol. 74, no. 17, pp. 6989–7014, 2015.

[22] J. Lezama, Q. Qiu, and G. Sapiro, “Not afraid of the dark: NIR-VISface recognition via cross-spectral hallucination and low-rank embed-ding,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jul. 2017,pp. 6807–6816.

[23] S. Li, D. Yi, Z. Lei, and S. Liao, “The CASIA NIR-VIS 2.0 face data-base,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. Workshops,Jun. 2013, pp. 348–353.

[24] W. Li, Y. Wu, M. Mukunoki, and M. Minoh, “Common-near-neighboranalysis for person re-identification,” in Proc. IEEE Conf. ImageProcess., Sep. 2012, pp. 1621–1624.

[25] D. Lopez-Paz, S. Sra, A. Smola, Z. Ghahramani, and B. Schölkopf,“Randomized nonlinear component analysis,” in Proc. Int. Conf. Mach.Learn., 2014, pp. 1359–1367.

[26] D. G. Lowe, “Distinctive image features from scale-invariant keypoints,”Int. J. Comput. Vis., vol. 60, no. 2, pp. 91–110, 2004.

[27] J. Lu, V. E. Liong, X. Zhou, and J. Zhou, “Learning compact binaryface descriptor for face recognition,” IEEE Trans. Pattern Anal. Mach.Intell., vol. 37, no. 10, pp. 2041–2056, Oct. 2015.

[28] H. Maeng, H.-C. Choi, U. Park, S.-W. Lee, and A. K. Jain, “NFRAD:Near-infrared face recognition at a distance,” in Proc. IEEE Int. JointConf. Biometrics, Oct. 2011, pp. 1–7.

[29] S. P. Mudunuri and S. Biswas, “Dictionary alignment for low-resolutionand heterogeneous face recognition,” in Proc. IEEE Winter Conf. Appl.Comput. Vis., Mar. 2017, pp. 1115–1123.

[30] S. P. Mudunuri, S. Venkataramanan, and S. Biswas, “Improved lowresolution heterogeneous face recognition using re-ranking,” in Proc.6th Nat. Conf. Comput. Vis., Pattern Recognit., Image Process. Graph.,2018, pp. 446–456.

[31] B.-S. Oh, K. Oh, A. B. J. Teoh, Z. Lin, and K.-A. Toh, “A Gabor-basednetwork for heterogeneous face recognition,” Neurocomputing, vol. 261,pp. 253–265, Oct. 2017.

[32] S. Ouyang, T. Hospedales, Y.-Z. Song, X. Li, C. C. Loy, andX. Wang, “A survey on heterogeneous face recognition: Sketch, infra-red, 3D and low-resolution,” Image Vis. Comput., vol. 56, pp. 28–48,Dec. 2016.

[33] O. M. Parkhi, A. Vedaldi, and A. Zisserman, “Deep face recognition,”in Proc. Brit. Mach. Vis. Conf., 2015, pp. 1–12.

[34] N. Rasiwasia, D. Mahajan, V. Mahadevan, and G. Aggarwal, “Clus-ter canonical correlation analysis,” in Proc. Artif. Intell. Stat., 2014,pp. 823–831.

[35] C. Reale, H. Lee, and H. Kwon, “Deep heterogeneous face recognitionnetworks based on cross-modal distillation and an equitable distancemetric,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. Workshops,Jul. 2017, pp. 32–38.

[36] C. Reale, N. M. Nasrabadi, H. Kwon, and R. Chellappa, “Seeing theforest from the trees: A holistic approach to near-infrared heterogeneousface recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit.Workshops, Jun. 2016, pp. 54–62.

[37] H. Roy and D. Bhattacharjee, “Local-gravity-face (LG-face) forillumination-invariant and heterogeneous face recognition,” IEEE Trans.Inf. Forensics Security, vol. 11, no. 7, pp. 1412–1424, Jul. 2016.

[38] M. S. Sarfraz, A. Schumann, A. Eberle, and R. Stiefelhagen. (2017).“A pose-sensitive embedding for person re-identification with expandedcross neighborhood re-ranking.” [Online]. Available: https://arxiv.org/abs/1711.10378

[39] S. Saxena and J. Verbeek, “Heterogeneous face recognition with CNNs,”in Proc. Eur. Conf. Comput. Vis. Workshops, 2016, pp. 1–8.

[40] A. Sharma, A. Kumar, H. Daume, and D. W. Jacobs, “Generalizedmultiview analysis: A discriminative latent space,” in Proc. IEEE Conf.Comput. Vis. Pattern Recognit., Jun. 2012, pp. 2160–2167.

[41] M. Singh, S. Nagpal, N. Gupta, S. Ghosh, R. Singh, and M. Vatsa,“Cross-spectral cross-resolution video database for face recognition,” inProc. IEEE Int. Conf. Biometrics Theory, Appl. Syst., Sep. 2016, pp. 1–7.

[42] T. Bourlai and B. Cukic, “Multi-spectral face recognition: Identificationof people in difficult environments,” in Proc. IEEE Int. Joint Conf. Intell.Secur. Inform., Jun. 2012, pp. 196–201.

[43] P. Viola and M. Jones, “Rapid object detection using a boosted cascadeof simple features,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit.,Dec. 2001, pp. 1–9.

[44] S. Wang, D. Zhang, Y. Liang, and Q. Pan, “Semi-coupled dictio-nary learning with applications to image super-resolution and photo-sketch synthesis,” in Proc. IEEE Int. Conf. Comput. Vis., Jun. 2012,pp. 2216–2223.

[45] X. Wu, R. He, Z. Sun, and T. Tan. (2015). “A light CNN for deep facerepresentation with noisy labels.” [Online]. Available: https://arxiv.org/abs/1511.02683

[46] X. Wu, L. Song, R. He, and T. Tan. (2017). “Coupled deep learning forheterogeneous face recognition.” [Online]. Available: https://arxiv.org/abs/1704.02450

[47] M. Ye et al., “Person reidentification via ranking aggregation of similar-ity pulling and dissimilarity pushing,” IEEE Trans. Multimedia, vol. 18,no. 12, pp. 2553–2566, Dec. 2016.

[48] R. Yu, Z. Zhou, S. Bai, and X. Bai. (2017). “Divide and fuse:A re-ranking approach for person re-identification.” [Online]. Available:https://arxiv.org/abs/1708.04169

[49] S. Zhang, M. Yang, T. Cour, K. Yu, and D. N. Metaxas, “Query specificfusion for image retrieval,” in Proc. Eur. Conf. Comput. Vis. Workshops,2012, pp. 660–673.

[50] L. Zheng, S. Wang, L. Tian, F. He, Z. Liu, and Q. Tian, “Query-adaptivelate fusion for image search and person re-identification,” in Proc. IEEEConf. Comput. Vis. Pattern Recognit., Jun. 2015, pp. 1741–1750.

[51] L. Zheng, Y. Yang, and A. Hauptmann. (2016). “Person re-identification:Past, present and future.” [Online]. Available: https://arxiv.org/abs/1610.02984

[52] J.-Y. Zhu, W.-S. Zheng, J.-H. Lai, and S. Z. Li, “Matching NIR faceto VIS face using transduction,” IEEE Trans. Inf. Forensics Security,vol. 9, no. 3, pp. 501–514, Mar. 2014.

Authors’ photographs and biographies not available at the time of publication.