super-resolution of very low-resolution faces from videos · architectures for video...

MITSUBISHI ELECTRIC RESEARCH LABORATORIEShttp://www.merl.com

Super-resolution of Very Low-Resolution Faces from Videos

Ataer-Cansizoglu, E.; Jones, M.J.

TR2018-140 September 26, 2018

AbstractFaces appear in low-resolution video sequences in various domains such as surveillance. Theinformation accumulated over multiple frames can help super-resolution for high magnificationfactors. We present a method to super-resolve a face image using the consecutive frames ofthe face in the same sequence. Our method is based on a novel multi-input-single-outputframework with a Siamese deep network architecture that fuses multiple frames into a singleface image. Contrary to existing work on video super-resolution, it is model free and doesnot depend on facial landmark detection that might be difficult to handle for very low-resolution faces. The experiments show that the use of multiple frames as input improves theperformance compared to single-inputsingle-output systems.

British Machine Vision Conference (BMVC)

This work may not be copied or reproduced in whole or in part for any commercial purpose. Permission to copy inwhole or in part without payment of fee is granted for nonprofit educational and research purposes provided that allsuch whole or partial copies include the following: a notice that such copying is by permission of Mitsubishi ElectricResearch Laboratories, Inc.; an acknowledgment of the authors and individual contributions to the work; and allapplicable portions of the copyright notice. Copying, reproduction, or republishing for any other purpose shall requirea license with payment of fee to Mitsubishi Electric Research Laboratories, Inc. All rights reserved.

Copyright c© Mitsubishi Electric Research Laboratories, Inc., 2018201 Broadway, Cambridge, Massachusetts 02139

CANSIZOGLU, JONES: SUPER-RESOLUTION OF VERY LR FACES FROM VIDEOS 1

Super-resolution of Very Low-ResolutionFaces from Videos

Esra [email protected]

Michael [email protected]

Mitsubishi Electric Research Labs(MERL)Cambridge, MA, USA

Abstract

Faces appear in low-resolution video sequences in various domains such as surveil-lance. The information accumulated over multiple frames can help super-resolution forhigh magnification factors. We present a method to super-resolve a face image usingthe consecutive frames of the face in the same sequence. Our method is based on anovel multi-input-single-output framework with a Siamese deep network architecturethat fuses multiple frames into a single face image. Contrary to existing work on videosuper-resolution, it is model free and does not depend on facial landmark detection thatmight be difficult to handle for very low-resolution faces. The experiments show thatthe use of multiple frames as input improves the performance compared to single-input-single-output systems.

1 IntroductionFace images appear in various platforms and are vital for many applications ranging fromforensics to health monitoring. In most cases, these images are low-resolution, makingface identification difficult. Therefore, upsampling low-resolution face images is crucial.In this paper, we present a method to generate a super-resolved face image from a givenlow-resolution (LR) face video sequence.

The use of multiple frames for super-resolution is well studied. However, previous workmostly focuses on images of general scenes and the magnification factor does not go beyond4×. Moreover, most of these algorithms rely on motion estimation between frames, which ishard when the target object in the image is very low-resolution. There are a few studies ad-dressing the problem of face super-resolution using videos [8, 9, 28, 29]. These methods areeither model-based [8, 9] or registration-based [28, 29]. Therefore, the magnification factoris small since the model fitting and registration stages highly depend on landmark detectionor texture, respectively. In order to address these limitations, we present a model-free multi-frame face super-resolution technique designed for very low-resolution face images. The LRfaces we target in this work are tiny with a facial area of around 6× 6 pixels that are diffi-cult to identify even by humans. Our method uses a deep network architecture consisting oftwo subnetworks. The first subnetwork is trained to super-resolve each frame independently,while the second subnetwork generates weights to fuse super-resolved images of all frames.

c© 2018. The copyright of this document resides with its authors.It may be distributed unchanged freely in print or electronic forms.

2 CANSIZOGLU, JONES: SUPER-RESOLUTION OF VERY LR FACES FROM VIDEOS

Thus, the fusion network uses the information accumulated over multiple images in order togenerate a single super-resolved image.

Our main contributions are three fold: (i) We present a novel method for multi-framesuper-resolution of very tiny face images with a magnification factor of 8 times. (ii) Thepresented method is model free and requires only rough alignment instead of pixelwise reg-istration such as many previous multi-frame super-resolution algorithms depend on. (iii)Our technique involves a novel multi-input-single-output (MISO) framework with a Siamesedeep network architecture that fuses multiple frames into a single face image. Instead of re-lying on a prior motion estimation stage that is challenging for very LR face images, ourfusion network implicitly uses the motion among frames in order to generate the final image.

1.1 Related WorkThe use of multiple frames to super-resolve a single image is well studied in the litera-ture [18]. There are a wide range of techniques including, iterative [6, 10], direct [5, 21]and probabilistic [1, 20, 32] approaches for performing multi-input-single-output (MISO)super-resolution. The early methods [6, 10] focused on iterative back projection in order toestimate the forward imaging model given low-resolution sequences. Most of these methodsdeal with frames with known transformations between them making it hard to apply to realscenarios. In the category of direct methods, registration among frames is estimated andutilized for producing a single output [5, 21]. Since registration plays an important role inthe super-resolution process, these methods are not capable of handling high magnificationfactors, especially if the target object appears very small in the input image. Probabilisticapproaches [1, 20, 32] utilize different regularization terms depending on the sufficiency ofthe given low-resolution images and the magnification factor. Designing the noise model andtuning regularization parameters are challenging due to the fact that the problem is ill-posed.

With the rise in deep learning algorithms, recently there have been attempts to improvesuper-resolution using neural networks. Some methods considered the registration amongframes. Liao et al. [14] presented a method that first generates super-resolution draft ensem-ble by using an existing motion estimation method and then reconstructing the final resultthrough a CNN deconvolving the draft ensemble. Tao et al. [22] introduces a subpixel mo-tion compensation layer in order to handle inter-frame motion. Ma et al. [17] presents amotion de-blurring method based on an expectation-maximization framework for estimatingleast blur and reconstruction parameters. Kappeler et al. [12] experiments with various CNNarchitectures for video super-resolution by incorporating neighboring frames alignment. Allof these methods are for general video SR and they only aim for at most 4× magnification.

There has been a tremendous amount of work for single-image super-resolution of faces [2,4, 13, 15, 16, 24]. Recently, a few papers have used convolutional neural networks (CNNs) orgenerative adversarial networks (GANs) for face-specific super-resolution. These methodsbetter handle uncontrolled input faces with variations in lighting, pose and expression andonly rough alignment. Zhou et al. [34]’s bi-channel approach used a CNN to extract featuresfrom the low-resolution input face and then mapped these using fully connected networklayers to an intermediate upsampled face image, which is linearly combined with a bicu-bicly interpolated upsampling of the input face image to create a final high-resolution outputface. In other work, Yu and Porikli [30, 31] used GANs for 8 times super-resolution of faceimages. Cao et al. [3] presented a deep reinforcement learning approach that sequentiallydiscovers attended patches followed by facial part enhancement. Zhu et al. [35] proposed abi-network architecture to solve super-resolution in a cascaded way. Each of these methods


uses a single image of the person to produce the final super-resolved image. Therefore, theydo not make use of the information coming from multiple frames.

Multiple frames for face super-resolution have been used in early studies in the field.However, the datasets were generated with simple motion translation with alignment andthe faces were captured under controlled environment. Baker et al. [1] followed a Bayesianformulation of the problem and presented a solution based on reconstruction and recognitionlosses. Their technique makes use of multiple frames that are generated by simple shiftingtransformations. Recently, Huber et al. [8, 9] introduced a video-based super-resolutionalgorithm that makes use of the Surrey face model. The technique fuses intensities comingfrom all frames in a 2D isomap representation of the face-model. It highly depends onlandmark detection, thus it is not possible to apply to very low-resolution faces such as theones used in this study. Two other related studies [28, 29] focus on the problem of faceregistration for MISO face super-resolution. Yu et al. [29] presented a method to performglobal transformation and local deformation for 3× upsampling of faces with expressionchanges. Similarly, Yoshida et al. [28] employs free-form deformation method to warp allframes into a reference frame followed by fusion of warped images. In both studies, thefacial area in input low-resolution images is large (i.e. greater than 30× 20 pixels), whichenables an accurate registration.

2 Method

2.1 Problem Statement and notationGiven a sequence of N low-resolution W ×H face images, SL = {I1

L,I2L, · · · ,IN

L }, our goal isto predict a corresponding high-resolution (HR) face image I with a size dW × dH whered ∈ N is a non-zero magnification factor. Each frame in the sequence is a LR version of theface in different poses and expressions. Thus

IkL = g(fk(I)) = DHfk(I) (1)

where fk : RdW×dH → RdW×dH denotes a deformation function that takes a face and trans-forms it to another pose and expression and g :RdW×dH→RW×H is the function representingthe LR image capturing process, which is composed of a blurring matrix H that representspoint spread function and a downsampling matrix D.

2.2 FrameworkDuring training, we are also provided corresponding HR image sequence SH = {I1

H , I2H , · · · , IN

H}.We are interested in finding inverse of both fk and g functions in order to obtain an estimatefor I. Considering the sequential nature of the process, we follow a two-step approach. Atthe first step, we learn a generator to estimate HR version Ik

H of each image in the sequence.Second, we estimate I from super-resolved images by learning a fusion function that repre-sents f−1

1 , f−12 , · · · , f−1

N .Let G : RW×H → RdW×dH be a generator function that creates super-resolved images

given LR images. We learn G by minimizing the following cost function:

Lgen =1

T N ∑<SL,SH>

N

∑iD(G(Ii

L), IiH) (2)


where T is the number of sequences in the training set and D is a distance measure betweentwo images.

The fusion function is parametrized by a sequence of weight maps, which produces aweighted sum of all the super-resolved images

F(SL,G) =N

∑i

Fi(I1L, · · · ,IN

L ,G(I1L), · · · ,G(IN

L ))⊗G(IiL) (3)

where Fi : (N×RW×H)+ (N×RdW×dH)→ RdW×dH denotes a function that returns pixelweights given a sequence of LR and corresponding super-resolved images and ⊗ indicateselement wise multiplication between matrices. F is learned by minimizing the fusion cost

L f us =1T ∑

<SL,SH>

D(F(SL,G), I). (4)

I is taken as HR version of the Kth frame from the sequence.We employ two types of distance measures in training: L2 distance and structural simi-

larity (SSIM) distance. L2 distance between two images X,Y ∈ RW×H is defined as

DL2(X,Y) = ‖X−Y‖F (5)

where ‖·‖F indicates Frobenius norm. SSIM is a popular measure that accounts for humansperception of image quality [33]. On a local patch of two images X and Y around pixel p,SSIM is computed as SSIM(X,Y,p) = 2µxµy+c1

µ2x +µ2

y +c1

2σxy+c2σ2

x +σ2y +c2

,where µx,µy and σx,σy denote

the mean and variance, respectively, of the intensities in the local patch around pixel p in Xand Y and σxy is the covariance of intensities in the two local patches. c1 and c2 are constantfactors to stabilize the division with weak denominator. We divide the image into h×h gridsand employ the mean of SSIM over all patches to come up with the SSIM distance betweentwo images

Dssim(X,Y) =1T

T

∑k=1

∑p

[1−SSIM(Xk,Yk,p)

](6)

where T is the total number of patches and Xk and Yk is the kth corresponding patch pair.Compared with reconstruction loss, minimizing SSIM loss helps recover the informationat high frequency visually. Consequently, we utilize a weighted sum of the two distancemeasures in training our algorithm

D(X,Y) =DL2(X,Y)+αDssim(X,Y). (7)

2.3 Network ArchitectureOverview of our system is seen in Figure 1. Each LR image in the input sequence is firstgiven to the generator network to obtain an estimate of their HR version. LR images areupsampled to the HR image size with bicubic interpolation1 and input to the fusion networkalong with the super-resolved images. The fusion network generates weight maps based onthe temporal information in the sequence.

1Any other upsampling technique can be used here.


Generator

Generator

Generator

Fusion

Bicubic

F1,…,FN

IL1,…,IL

N

SL

G(IL1)..G(IL

N) F(SL, G)

Figure 1: System overview. Dashed lines indicate weight sharing between subnetworks.

Network Layer Index Type Kernel Stride Pad Output Learning(Depth) Size Channels Rate

Generator 1 Conv 3 1 1 512 10−4

Generator 2 Deconv 2 2 0 512 0Generator 3 Conv 3 1 1 256 10−4



Generator 8 Conv 5 1 2 1 10−4

Fusion 1-5 Conv 3 1 1 64 10−5

Fusion 6 Conv 3 1 1 1 10−6

Table 1: Network architecture

For generator we use a slightly modified version of [30] trained on gray-scale images.For fusion network we use a similar architecture with the gated network of [35]. The detailsof our architecture is given in Table 1. Each convolution layer is followed by RELU exceptthe last layers. The deconvolution layers perform a simple nearest neighbor upsampling ofthe data and their weights are kept fixed during training.

3 Experiments

We carried out experiments in controlled and uncontrolled environments on three datasets.We first performed experiments on FRGC dataset [19], which consists of frontal face imagescaptured under a controlled environment. We generated a video sequence for each imagein the dataset by simulating a similarity transform. In order to analyze the performanceof the method on real-life scenarios, we did experiments on ChokePoint [26] and YouTubeFaces [25] datasets, which contains videos of faces in-the-wild.


3.1 Dataset and Experimental Setup

FRGC: The FRGC dataset contains frontal face images taken in a studio setting under twolighting conditions with only two facial expressions (smiling and neutral). We generatedtraining and test splits, where we kept the identities in each set disjoint. The training setconsisted of 20,000 images from 409 subjects and the test set consisted of 2,149 imagesfrom 142 subjects. In order to generate a sequence of face images, for each image in thetraining or testing set, we randomly generated different similarity transforms and appliedthem to the HR image, followed by LR image generation.ChokePoint: This dataset consists of face videos of subjects captured, when they are walk-ing through a portal in a natural way. We used the fourth sequence of the images in P1E,P1L and P2E portals as test set, and the rest as training set, achieving a diverse set of trainingand test images for enter-exit and indoor- outdoor scenarios. We discarded the frames withsmaller facial region than our experimental HR setting. As a result, we ended up having8,272 training image sequences and 1,884 test sequences.YouTube Faces: This dataset contains 3,245 videos of 1,595 people downloaded fromYouTube. It comes with 10 folds separated with disjoint identities among them. We usedthe first fold as the test set and the rest as training set. As a result, we had 46,693 test imagesequences and 402,003 training image sequences.Data preparation: HR images had 128× 128 pixel resolution, where the faces occupiedapproximately 50×50 pixels. We generated low-resolution images with 8 times downsam-pling by following the approach in [27]. As a result, the LR images contained a facial areaof approximately 6× 6 pixels. We used N = 8 consecutive frames in forming our MISOdatasets. The images in the sequence were roughly aligned following a similarity transformbased on location of eye and mouth centers. ChokePoint dataset already provides facial land-mark points as ground truth. For FRGC and YouTube Faces datasets, we employed landmarkdetection method in [23]. In practice, a low-resolution face detector would be sufficient forrough alignment. We set K = 8 and superresolved the last frame in each sequence.Algorithm Details and Experimental Setup: We trained two different networks dependingon the distance measure used. In other words, we either used only L2 distance by settingα = 0 or used both L2 and SSIM distances as a joint distance measure. The value of α is se-lected with a greedy search procedure as α = 1000. We super-resolved test image sequencesusing our multi-input-single-output (MISO) super-resolution framework. Since the first partof our network architecture (i.e. generator network) is considered single-input-single-output(SISO), we also performed experiments by inputting the test images one by one to the gen-erator network. Our goal is to better understand the gain in using video inputs as opposed tosingle images.

The code is implemented using Caffe deep learning framework [11]. Optimization wasperformed using the RMSProp algorithm [7]. The training took 3 days on a NVIDIAGeForce GTX TITAN X GPU with Maxwell architecture and 12GB memory. Average run-ning time for super-resolution of a sequence is 0.12 second. We used a patch size of h = 8for SSIM loss.

3.2 Quantitative Results

We evaluated the performance of our algorithm using peak signal-to-noise-ratio (PSNR) andstructural similarity (SSIM) measures as reported in Table 2. As can be seen using only L2distance can yield larger PSNR values, but the use of SSIM loss improves SSIM scores as


Method FRGC ChokePoint YouTube FacesPSNR SSIM PSNR SSIM PSNR SSIM

Bicubic 23.812 0.606 26.034 0.660 23.591 0.598VSRnet [12] 22.997 0.565 26.778 0.693 23.907 0.620SPMC [22] 21.388 0.555 23.083 0.648 21.189 0.539SISO (only L2 distance) 26.810 0.808 27.353 0.746 24.676 0.682SISO (joint distance) 26.767 0.809 27.494 0.751 24.791 0.687MISO (only L2 distance) 26.892 0.811 27.663 0.761 25.088 0.700MISO (joint distance) 26.903 0.813 27.726 0.764 25.108 0.701

Table 2: PSNR and SSIM values for the experimental results on all datasets.

expected. The improvement from a SISO architecture to MISO architecture is more obviouson ChokePoint dataset. This might be due to the fact that ChokePoint contains subjects thatwalk in a certain trajectory yielding certain motion patterns in the sequences. Hence, fusionnetwork benefits from the motion pattern in order to generate more accurate super-resolutionresults.

We also compared our algorithm with two recent existing video super-resolution meth-ods: VSRnet by Kappeler et al. [12] and SPMC by Tao et al. [22]. Both methods are designedfor general scenes with at most 4× magnification. Thus, we applied the methods with 4×upsampling followed by 2× upsampling. We used the provided models by the authors totest on the face datasets. The sequence size is 5 and 3 for VSRnet and SPMC respectively.Thus, the final single image is produced using 9 frames for VSRnet and 5 frames for SPMCmethod. Since these methods depend heavily on a preprocessing stage for motion estimation,they suffer from the small facial region seen in the LR images. On the contrary our algorithmis more adaptable, where motion information is learned in the fusion network implicitly. Ascan be seen from the Table 2, our method outperforms VSRnet and SPMC based on PSNRand SSIM measures.

3.3 Qualitative Results

We provide example results on all datasets using our SISO and MISO architectures withdifferent loss functions in Figure 22. Each row from top to bottom shows results for FRGC,ChokePoint and YouTube Faces datasets respectively. As can be seen the use of multipleframes creates more visually appealing results. Also, the images super-resolved using thenetwork trained to minimize SSIM loss along with the reconstruction loss produce in sharperlooking images.

Comparative results on FRGC, ChokePoint and YouTube Faces datasets are displayed inFigures 3, 4 and 5 respectively. Note that horizontal lines on VSRnet results are due to thepreprocessing stage of the algorithm that uses patches 3. Since we computed our evaluationmeasures on a bounding box around the face in all our experiments, the spurious lines werenot included in the evaluation.

2Since our network is trained on gray-scale images, color images are obtained by super-resolving the luminancechannel and adding bicubic-upsampled chrominance channels to the result.

3The displayed images are cropped from left and right for better visualization. Vertical black lines also appearin cropped out areas.


Input Bicubic SISO L2 SISO Joint MISO L2 MISO Joint GT

Figure 2: Qualitative results using MISO and SISO architectures with different loss func-tions. Each row from top to bottom contains results from FRGC, ChokePoint and YouTubeFaces datasets respectively.

Input Bicubic VSRnet [12] SPMC [22] MISO L2 MISO Joint GT

Figure 3: Qualitative results on FRGC Dataset


Figure 4: Qualitative results on ChokePoint Dataset



Figure 5: Qualitative results on YouTube Faces Dataset

4 Conclusion

We presented a method for super-resolving face images by using multiple images as in-put. Motivated by availability of cameras and saving medium, we followed a multi-input-single-output approach. The LR faces we target in this work are tiny with a facial area ofaround 6×6 pixels that are difficult to identify even with human eye. Our experiments showthat using multi-input-single-output framework creates more accurate images compared tosingle-input-single-output framework.

Our network architecture consisted of two subnetworks. The first subnetwork super-resolved each frame in the sequence. This was followed by a fusion network, which usedthe accumulated information over the frames and produced the super-resolution of one ofthe frames in the sequence. We used a super-resolution network architecture that consistedof 5 layers. However, our architecture is modular and the super-resolution network can bechanged with any existing super-resolution network from the literature. Thus, our fusionnetwork idea is a general one that can be utilized with any other existing SR technique.

In our experiments, we used a sequence size of 8, which was decided empirically. Theresults degraded as we increased the sequence size. This might be due to the fact that ournetwork architecture cannot handle large motion variation since we kept kernel sizes small inorder to perform SR and fusion in small patches. Thus, magnification factor, frame rate andmotion variation of the videos are factors that would affect the choice of sequence size. Al-though, increasing sequence size would provide more information for SR of a single image,it requires a more careful and sophisticated use of the data.

An important strength of the method is its independence from face models and motionestimation. Estimating motion or detecting facial landmarks become more difficult as thetarget face size gets smaller. An interesting extension to the current algorithm can be per-forming SR in a cascaded way and carry out motion estimation or model fitting when thefacial details are visible in a cascade level. However, training a cascaded system (e.g. 2×


upsampling in each cascade) would be more time consuming and accuracy of the results ineach cascade would be more critical as later stages depend on it.

This work assumes that faces were tracked throughout the sequence. In real life cases,the cameras are mounted at a specific location (e.g. surveillance cameras at an airport)and other body parts of the person are also visible in the scene. Therefore, tracking facescan benefit from human tracking. Moreover, in some surveillance settings there are certaintrajectories that people follow that will also help improve tracking performance as a motionprior. Repeated motion patterns in the sequences will be also useful for our fusion network toprovide a more precise output. Solving tracking and super-resolution problems in alternatingsteps will be the future extension of this study.

AcknowledgementsWe thank Alan Sullivan, Ziming Zhang, Suzuki Daisuke and Toyoda Yoshitaka for insightfuldiscussions.


References[1] Simon Baker and Takeo Kanade. Limits on super-resolution and how to break them.

IEEE Trans. Pattern Anal. Mach. Intell., 24(9):1167–1183, 2002.

[2] Simon Baker and Takeo Kanade. Limits on super-resolution and how to break them.IEEE Trans. Pattern Anal. Mach. Intell., 24(9):1167–1183, 2002.

[3] Qingxing Cao, Liang Lin, Yukai Shi, Xiaodan Liang, and Guanbin Li. Attention-awareface hallucination via deep reinforcement learning. In Proc. IEEE Conf. ComputerVision and Pattern Recognition (CVPR), 2017.

[4] David Capel and Andrew Zisserman. Super-resolution from multiple views using learntimage models. In Proc. IEEE Conf. Computer Vision and Pattern Recognition (CVPR),2001.

[5] Sina Farsiu, M Dirk Robinson, Michael Elad, and Peyman Milanfar. Fast and robustmultiframe super resolution. IEEE Transactions on Image Processing, 13(10):1327–1344, 2004.

[6] Sina Farsiu, Michael Elad, and Peyman Milanfar. Multiframe demosaicing and super-resolution of color images. IEEE Transactions on Image Processing, 15(1):141–159,2006.

[7] G Hinton, N Srivastava, and K Swersky. RMSProp: Divide the gradient by a run-ning average of its recent magnitude. Neural networks for machine learning, Courseralecture 6e, 2012.

[8] Patrik Huber, William Christmas, Adrian Hilton, Josef Kittler, and Matthias Rätsch.Real-time 3D face super-resolution from monocular in-the-wild videos. In Proc. SIG-GRAPH, page 67. ACM, 2016.

[9] Patrik Huber, Philipp Kopp, William Christmas, Matthias Rätsch, and Josef Kittler.Real-time 3D face fitting and texture fusion on in-the-wild videos. IEEE Signal Pro-cessing Letters, 24(4):437–441, 2017.

[10] Michal Irani and Shmuel Peleg. Image sequence enhancement using multiple motionsanalysis. In Proc. IEEE Conf. Computer Vision and Pattern Recognition (CVPR), pages216–221. IEEE, 1992.

[11] Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, RossGirshick, Sergio Guadarrama, and Trevor Darrell. Caffe: Convolutional architecturefor fast feature embedding. arXiv preprint arXiv:1408.5093, 2014.

[12] Armin Kappeler, Seunghwan Yoo, Qiqin Dai, and Aggelos K Katsaggelos. Videosuper-resolution with convolutional neural networks. IEEE Transactions on Compu-tational Imaging, 2(2):109–122, 2016.

[13] Yang Li and Xueyin Lin. Face hallucination with pose variation. In Inter. Conf. onAutomatic Face and Gesture Recognition, 2004.


[14] Renjie Liao, Xin Tao, Ruiyu Li, Ziyang Ma, and Jiaya Jia. Video super-resolutionvia deep draft-ensemble learning. In Proc. IEEE Int’l Conf. Computer Vision (ICCV),pages 531–539, 2015.

[15] Ce Liu, Harry Y. Shum, and Changshui Zhang. A two-step approach to hallucinatingfaces: global parametric model and local nonparametric model. In Proc. IEEE Conf.Computer Vision and Pattern Recognition (CVPR), 2001.

[16] Ce Liu, Heung-Yeung Shum, and William T. Freeman. Face hallucination: Theory andpractice. Int’l J. Computer Vision, 75(1):115–134, 2007.

[17] Ziyang Ma, Renjie Liao, Xin Tao, Li Xu, Jiaya Jia, and Enhua Wu. Handling motionblur in multi-frame super-resolution. In Proc. IEEE Conf. Computer Vision and PatternRecognition (CVPR), pages 5224–5232. IEEE, 2015.

[18] Kamal Nasrollahi and Thomas B Moeslund. Super-resolution: a comprehensive survey.Machine vision and applications, 25(6):1423–1468, 2014.

[19] P Jonathon Phillips, Patrick J Flynn, Todd Scruggs, Kevin W Bowyer, Jin Chang, KevinHoffman, Joe Marques, Jaesik Min, and William Worek. Overview of the face recog-nition grand challenge. In Proc. IEEE Conf. Computer Vision and Pattern Recognition(CVPR), volume 1, pages 947–954, 2005.

[20] Lyndsey C Pickup, David P Capel, Stephen J Roberts, and Andrew Zisserman.Bayesian image super-resolution, continued. In Proc. Neural Information ProcessingSystems (NIPS), pages 1089–1096, 2007.

[21] Matan Protter and Michael Elad. Super resolution with probabilistic motion estimation.IEEE Transactions on Image Processing, 18(8):1899–1904, 2009.

[22] Xin Tao, Hongyun Gao, Renjie Liao, Jue Wang, and Jiaya Jia. Detail-revealing deepvideo super-resolution. In Proc. IEEE Int’l Conf. Computer Vision (ICCV), 2017.

[23] Oncel Tuzel, Tim K Marks, and Salil Tambe. Robust face alignment using a mixture ofinvariant experts. In Proc. European Conf. Computer Vision (ECCV), pages 825–841,2016.

[24] Nannan Wang, Dacheng Tao, Xinbo Gao, Xuelong Li, and Jie Li. A comprehensivesurvey to face hallucination. Int’l J. Computer Vision, 106(1):9–30, 2014.

[25] Lior Wolf, Tal Hassner, and Itay Maoz. Face recognition in unconstrained videos withmatched background similarity. In Proc. IEEE Conf. Computer Vision and PatternRecognition (CVPR), pages 529–534, 2011.

[26] Yongkang Wong, Shaokang Chen, Sandra Mau, Conrad Sanderson, and Brian C.Lovell. Patch-based probabilistic image quality assessment for face selection and im-proved video-based face recognition. In IEEE Biometrics Workshop, Computer Visionand Pattern Recognition (CVPR) Workshops, pages 81–88. IEEE, June 2011.

[27] Chih-Yuan Yang, Chao Ma, and Ming-Hsuan Yang. Single-image super-resolution: Abenchmark. In Proc. European Conf. Computer Vision (ECCV), pages 372–386, 2014.


[28] Tomonari Yoshida, Tomokazu Takahashi, Daisuke Deguchi, Ichiro Ide, and HiroshiMurase. Robust face super-resolution using free-form deformations for low-qualitysurveillance video. In IEEE International Conference on Multimedia and Expo(ICME), pages 368–373. IEEE, 2012.

[29] Jiangang Yu and Bir Bhanu. Super-resolution of facial images in video with expres-sion changes. In IEEE International Conference on Advanced Video and Signal BasedSurveillance, pages 184–191. IEEE, 2008.

[30] Xin Yu and Fatih Porikli. Ultra-resolving face images by discriminative generativenetworks. In Proc. European Conf. Computer Vision (ECCV), pages 318–333, 2016.

[31] Xin Yu and Fatih Porikli. Hallucinating very low-resolution unaligned and noisy faceimages by transformative discriminative autoencoders. In Proc. IEEE Conf. ComputerVision and Pattern Recognition (CVPR), pages 3760–3768, 2017.

[32] Qiangqiang Yuan, Liangpei Zhang, Huanfeng Shen, and Pingxiang Li. Adaptivemultiple-frame image super-resolution based on u-curve. IEEE Transactions on Im-age Processing, 19(12):3157–3170, 2010.

[33] Hang Zhao, Orazio Gallo, Iuri Frosio, and Jan Kautz. Loss functions for image restora-tion with neural networks. IEEE Transactions on Computational Imaging, 3(1):47–57,2017.

[34] Erjin Zhou, Haoqiang Fan, Zhimin Cao, Yuning Jiang, and Qi Yin. Learning facehallucination in the wild. In Proc. of the AAAI Conf. on Artificial Intelligence, 2015.

[35] Shizhan Zhu, Sifei Liu, Chen Change Loy, and Xiaoou Tang. Deep cascaded bi-network for face hallucination. In European Conference on Computer Vision, pages614–630. Springer, 2016.

super-resolution of very low-resolution faces from videos · architectures for video...

Documents