Unconstrained Face Recognition: Deep Learning Approaches
Chun-Ting Huang
2016/7/22USC Multimedia Communication Lab 2
2016/7/22USC Multimedia Communication Lab 3
http://www.nytimes.com/2015/08/13/us/facial-recognition-software-moves-from-overseas-wars-to-local-police.html?_r=0
Why Face?
▪ Facial features scored highest compatibility in a Machine Readable Travel Documents (MRTD) system
2016/7/22USC Multimedia Communication Lab 4
Hietmeyer, R.: Biometric identification promises fast and secure processing of airline passengers. ICAO J. 55(9), 10–11 (2000)
Outline
▪ Introduction
▪ Unconstrained face dataset
▪ Unconstrained face recognition with deep learning
▪ Papers from industry
▪ Papers from academia
▪ Discussion and conclusion
2016/7/22USC Multimedia Communication Lab 5
Introduction
Categorization
▪ A face recognition system operates in two modes
▪ Face verification (authentication)
▪ Face identification (recognition)
▪ Face verification
▪ One-to-one match
▪ Between query face image against an enrollment face image
▪ Face identification
▪ One-to-many match
▪ Between query face against multiple faces in the enrollment database
2016/7/22USC Multimedia Communication Lab 7
Face Recognition Processing Flow
2016/7/22USC Multimedia Communication Lab 8
Jain, Anil K., and Stan Z. Li. Handbook of face recognition. Vol. 1. New York: springer, 2011
Face Subspace
2016/7/22USC Multimedia Communication Lab 9
Jain, Anil K., and Stan Z. Li. Handbook of face recognition. Vol. 1. New York: springer, 2011
Frontal Face Recognition
2016/7/22USC Multimedia Communication Lab 10
Conventional Approaches
2016/7/22USC Multimedia Communication Lab 11
▪ Template matching
▪ PCA: M. Turk, A. Pentland, Eigenfaces for Recognition, Journal of Cognitive Neurosicence, Vol. 3, No. 1, Win. 1991
▪ LDA: Kamran Etemad and Rama Chellappa, ” Discriminant analysis for recognition of human face images”, JOSA A, 1997
▪ HOG: Dalal, Navneet, and Bill Triggs. "Histograms of oriented gradients for human detection." Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on. Vol. 1. IEEE, 2005
▪ LBP: Ahonen, Timo and Hadid, Abdenour and Pietikainen, Matti, “Face description with local binary patterns: Application to face recognition”, Pattern Analysis and Machine Intelligence, IEEE Transactions on, 2006
Frontal is NOT Enough
2016/7/22USC Multimedia Communication Lab 12
Facial Landmark Localization
▪ Model based approach
▪ ASM: T.F. Cootes and C.J. Taylor and D.H. Cooper and J. Graham (1995). "Active shape models - their training and application". Computer Vision and Image Understanding (61): 38–59
▪ AAM: T.F. Cootes, G. J. Edwards, and C. J. Taylor. Active appearance models. ECCV, 2:484–498, 1998
▪ Regression based approach
▪ Cascade pose regression: P. Doll’ar, P. Welinder, and P. Perona. “Cascaded pose regression”. In CVPR. IEEE, 2010
▪ Explicit shape regression: X. Cao, Y.Wei, F.Wen, and J. Sun. “Face alignment by explicit shape regression”. In CVPR. IEEE, 2012
2016/7/22USC Multimedia Communication Lab 13
2016/7/22USC Multimedia Communication Lab 14
Explicit Shape Regression
2016/7/22USC Multimedia Communication Lab 15
t = 0 t = 1 t = 2 … t = 10
𝐼: image
initialized
from
face
detector
affine
transformtransform
back
…
Unconstrained Face Dataset
Labeled Faces in the Wild
▪ Contains 13233 images
▪ Consists of 5749 people
▪ 1680 people with two or more images
▪ Proposed in ICCV 2007
▪ Photos are collected through internet
▪ Also provide aligned faces with three types of alignment methods
USC Multimedia Communication Lab 17
Gary B. Huang, Manu Ramesh, Tamara Berg, and Erik Learned-Miller. Labeled Faces in the Wild: A Database for Studying Face Recognition in Unconstrained
Environments. University of Massachusetts, Amherst, Technical Report 07-49, October, 2007.
2016/7/22
LFW: Performance (Image-Restricted)
2016/7/22USC Multimedia Communication Lab 18
LFW: Performance (Image-Unrestricted)
2016/7/22USC Multimedia Communication Lab 19
2016/7/22USC Multimedia Communication Lab 20
Youtube Face Database
▪ Lior Wolf, Tal Hassner and Itay Maoz, Face Recognition in Unconstrained Videos with Matched Background Similarity. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2011
▪ 3425 videos of 1595 people
2016/7/22USC Multimedia Communication Lab 21
YTF: Performance (Image-Restricted)
2016/7/22USC Multimedia Communication Lab 22
YTF: Performance (Image-Restricted)
▪ EER - the error rate at the ROC operating point where the false positive and false negative rates are equal
2016/7/22USC Multimedia Communication Lab 23
Parkhi, Omkar M., Andrea Vedaldi, and Andrew Zisserman. "Deep face recognition." Proceedings of the British Machine Vision 1.3 (2015): 6.
IARPA Janus benchmark A
▪ Klare et al. Pushing the Frontiers of Unconstrained Face Detection and Recognition: IARPA Janus Benchmark A, CVPR, June 2015
▪ All labeled with manual bounding box annotation with fiducial landmarks
▪ Amazon Mechanical Turk (AMT)
▪ LFW are not fully constrained:
▪ Commodity face detector was used to detect all faces
▪ Restricted to pose variation, occlusions, and illuminations conditions
▪ Three landmarks: two eyes, and base of nose
▪ Geographic distribution
7/22/2016USC Multimedia Communications Lab 24
IJB-A Labeled Information
▪ 10-fold gallery / probe image set
▪ 17,000 images for training (333 subjects)
▪ Gallery set: 3000 images (167 subjects)
▪ Probe set: 13,700 images (include non-gallery subjects)
▪ X Y coordinates of eyes and nose base
▪ Face yaw angle (if applicable)
▪ Observation labeling: FOREHEAD_VISIBLE, EYES_VISIBLE, NOSE_MOUTH_VISIBLE, INDOOR, GENDER, SKIN_TONE (6 levels), AGE (5 levels), FACIAL_HAIR
7/22/2016USC Multimedia Communications Lab 25
Pose Variant
7/22/2016USC Multimedia Communications Lab 26
7/22/2016USC Multimedia Communications Lab 27
7/22/2016USC Multimedia Communications Lab 28
7/22/2016USC Multimedia Communications Lab 29
7/22/2016USC Multimedia Communications Lab 30
IJB-A Released Benchmark (1/29/2016)
7/22/2016USC Multimedia Communications Lab 31
Unconstrained Face RecognitionWith Deep Learning
Facebook: DeepFace
▪ DeepFace: Closing the Gap to Human-Level Performance in Face Verification
▪ Yaniv Taigman, Ming Yang, Marc'Aurelio Ranzato, Lior Wolf; The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2014, pp. 1701-1708
▪ Claimed contributions
▪ Facial alignment with 3D modeling
▪ Advance LFW benchmark performance
▪ Reaching near human-performance
▪ Advance YTF benchmark performance
USC Multimedia Communication Lab 33Yaniv Taigman, Ming Yang, Marc'Aurelio Ranzato, Lior Wolf; The IEEE Conference on Computer Vision and Pattern Recognition
(CVPR), 2014, pp. 1701-1708 2016/7/22
3D Facial Alignment
▪ Detected face provided with 6 initial fiducial points
▪ 2D-aligned crop
▪ 67 fiducial points from Delaunay triangulation
▪ 3D shape transform
▪ Triangle visibility w.r.t. to the fitted 3D-2D camera
▪ Affine warping
▪ Final frontalized crop
2016/7/22USC Multimedia Communication Lab 34
DeepFace Architecture
2016/7/22USC Multimedia Communication Lab 35
DeepFace: Performance
▪ Results on Labeled Face in the Wild (LFW) and YouTube Faces (YTF) databases
USC Multimedia Communication Lab 36Yaniv Taigman, Ming Yang, Marc'Aurelio Ranzato, Lior Wolf; The IEEE Conference on Computer Vision and Pattern Recognition
(CVPR), 2014, pp. 1701-1708 2016/7/22
DeepID
▪ Sun, Yi, Xiaogang Wang, and Xiaoou Tang. "Deep learning face representation from predicting 10,000 classes." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2014
2016/7/22USC Multimedia Communication Lab 37
60 patches
DeepID
2016/7/22USC Multimedia Communication Lab 38
DeepID Performance (1)
2016/7/22USC Multimedia Communication Lab 39
160-dimensional feature
DeepID Performance (2)
2016/7/22USC Multimedia Communication Lab 40
o: outside dataset
u: unrestricted protocol
r: restricted protocol
Google: FaceNet
▪ Schroff, Florian, Dmitry Kalenichenko, and James Philbin. "Facenet: A unified embedding for face recognition and clustering." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2015
2016/7/22USC Multimedia Communication Lab 41
FaceNet
▪ Objective - learning a Euclidean embedding per image with DNN
▪ Map the face images to a compact Euclidean space
▪ Distance in space = Face Similarity
▪ Approach – DNN with triplet loss
2016/7/22USC Multimedia Communication Lab 42
Triplet Loss
▪ Embedding: 𝑓(𝑥) ∈ ℝ𝑑
▪ Input image as 𝑥𝑖𝑎 (anchor), 𝑥𝑖
𝑝(positive), and 𝑥𝑖
𝑛 (negative)
▪ 𝛼 is a margin between positive and negative pairs
▪ Corresponding loss function
2016/7/22USC Multimedia Communication Lab 43
Triplet Selection
▪ To achieve fast convergence for previous loss function
▪ Select 𝑥𝑖𝑝
for (hard positive)
▪ Select 𝑥𝑖𝑛 for (hard negative)
▪ Sampled the training set with
▪ 40 faces per identity in each mini-batch as positive examplars
▪ Randomly sampled negative faces are added
▪ To avoid converging to bad local minima
▪ (semi-hard)
2016/7/22USC Multimedia Communication Lab 44
Deep Convolutional Networks
▪ CNN is trained using Stochastic Gradient Descent (SGD) with standard backpropagation
▪ Two types of architectures
▪ Zeiler&Fergus architecture
▪ GoogLeNet style Inception model
▪ Trained on a CPU cluster for 1000 to 2000 hours
▪ 100M-200M training face thumbnails consisting 8M identities
▪ Input sizes range from 96x96 to 224x224 pixels
2016/7/22USC Multimedia Communication Lab 45
Zeiler, Matthew D., and Rob Fergus. "Visualizing and understanding convolutional networks." Computer vision–ECCV 2014. Springer International Publishing, 2014.
818-833.
Szegedy, Christian, et al. "Going deeper with convolutions." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2015.
Network Details
2016/7/22USC Multimedia Communication Lab 46
Performance
▪ Validation rate VAL (true accepts / same identity pairs) on 1M hold-out test set
▪ Output dimension (embedding dimension)’s VAL
2016/7/22USC Multimedia Communication Lab 47
Sensitivity to Image Quality
2016/7/22USC Multimedia Communication Lab 48
Deep Face Recognition
▪ Parkhi, Omkar M., Andrea Vedaldi, and Andrew Zisserman. "Deep face recognition." Proceedings of the British Machine Vision 1.3 (2015): 6.
▪ Achieved similar performance on LFW and YTF dataset
▪ With less training images and identities
▪ 2.6M images collected from Google images and Bing with keyword “actor”
▪ Same triplet loss strategy with FaceNet
2016/7/22USC Multimedia Communication Lab 49
Fine-tuned with VGG Model
▪ The “Very Deep” Architecture
▪ Different from previous architectures proposed
▪ Network Details:
▪ 3 x 3 Convolution Kernels (Very small)
▪ Conv. Stride 1 px.
▪ Relu non-linearity
▪ No local contrast normalisation
▪ 3 Fully connected layers
2016/7/22USC Multimedia Communication Lab 50
image
Conv-64
maxpool
fc-4096
fc-4096
Softmax
Conv-64
Conv-128
maxpool
Conv-128
Conv-256
maxpool
Conv-256
Conv-512
maxpool
Conv-512
Conv-512
Conv-512
maxpool
Conv-512
Conv-512
fc-2622
Simonyan, Karen, and Andrew Zisserman. "Very deep convolutional networks for large-scale image recognition." arXiv preprint
arXiv:1409.1556 (2014).
Training• MatConvNet Tootlbox
• Nvidia CuDNN bindings
• Multi GPU Training (approx 3.5x speedup)
• Nvidia Titan Black
• 7 days of training
• Stochastic Gradient Descent with back prop.
• Accumulator Descent for large batch sizes
• Batch Size: 256
• Incremental FC layer training
• 2622 way multi class criterion (soft max)
2016/7/22USC Multimedia Communication Lab 51
image
Conv-64
maxpool
fc-4096
fc-4096
Softmax
Conv-64
Conv-128
maxpool
Conv-128
Conv-256
maxpool
Conv-256
Conv-512
maxpool
Conv-512
Conv-512
Conv-512
maxpool
Conv-512
Conv-512
fc-2622
Vedaldi, Andrea, and Karel Lenc. "MatConvNet: Convolutional neural networks for matlab."Proceedings of the 23rd Annual ACM Conference
on Multimedia Conference. ACM, 2015.
Performance on LFW
2016/7/22USC Multimedia Communication Lab 52
No. Method # Training
Images
# Networks Accuracy
1 Fisher Vector Faces - - 93.10
2 DeepFace 4 M 3 97.35
3 DeepFace Fusion 500 M 5 98.37
4 DeepID-2,3 Full 200 99.47
5 FaceNet 200 M 1 98.87
6 FaceNet+
Alignment
200 M 1 99.63
7 VGG Face 2.6 M 1 98.95
Performance on YTF
2016/7/22USC Multimedia Communication Lab 53
No. Method # Training
Images
# Networks 100%-EER Accuracy
1 Video Fisher Vector
Faces
- - 87.7 93.10
2 DeepFace 4 M 1 91.4 91.4
4 DeepID-2,2+,3 200 - 93.2
5 FaceNet +
Alignment
200 M 1 - 95.1
7 VGG Face 2.6 M 1 97.4 97.3
Lightened CNN
▪ Wu, Xiang, Ran He, and Zhenan Sun. "A Lightened CNN for Deep Face Representation." arXiv preprint arXiv:1511.02683 (2015).
▪ Obtained competitive performance with previous models
▪ Composed by two networks
▪ New activation function: Max-Feature-Map (MFM) to replace ReLU
2016/7/22USC Multimedia Communication Lab 54
Max-Feature-Map
2016/7/22USC Multimedia Communication Lab 55
2016/7/22USC Multimedia Communication Lab 56
Performance
▪ On LFW:
▪ On YTF:
2016/7/22USC Multimedia Communication Lab 57
Deep Learning Applications Other than Recognition
Incorrect Alignment
2016/7/22USC Multimedia Communication Lab 59
Liu, Ziwei, et al. "Deep learning face attributes in the wild." Proceedings of the IEEE International Conference on Computer Vision. 2015.
Deep Learning Face Attributes
2016/7/22USC Multimedia Communication Lab 60
Details of the Networks
▪ Applied AlexNet directly for LNet
▪ Pre-trained with ImageNet 1000 object categories
▪ Fine-tuning LNet using attribute tags
2016/7/22USC Multimedia Communication Lab 61
Face Localization Performance (LNet)
2016/7/22USC Multimedia Communication Lab 62
Face localization performance (LNet)
2016/7/22USC Multimedia Communication Lab 63
Face Attributes Visualization
2016/7/22USC Multimedia Communication Lab 64
Attribute Accuracy
2016/7/22USC Multimedia Communication Lab 65
Discussion and Conclusion
LFW Survey
▪ Labeled Faces in the Wild: A Survey: Erik Learned-Miller, Gary Huang, AruniRoyChowdhury, Haoxiang Li, Gang Hua
▪ The future of face recognition
▪ Verification versus identification
▪ Not uncommon that two random individuals have large differences in appearance
▪ The more people in a gallery, the greater the chance that two individuals have similar appearance
▪ New face dataset
▪ IJB-A
▪ CASIA
▪ FaceScrub
▪ MegaFace
2016/7/22USC Multimedia Communication Lab 67
Discussion
▪ Unconstrained face recognition is a competitive field
▪ Target dataset: IJB-A
▪ Testing different approaches (with source code / trained models)
▪ Working on checking the effectiveness of lightened CNN
▪ Facial attributes may serve as auxiliary purpose
2016/7/22USC Multimedia Communication Lab 68
Large-scale CelebFaces Attributes (CelebA) Dataset
▪ S. Yang, P. Luo, C. C. Loy, and X. Tang, "From Facial Parts Responses to Face Detection: A Deep Learning Approach", in IEEE International Conference on Computer Vision (ICCV), 2015
▪ 10,177 number of identities
▪ 202,599 number of face images
▪ 5 landmark locations, 40 binary attributes annotations per image
▪ Available for download
▪ 1.34 GB for 202,599 align&cropped face images
▪ Similarity transformation according to two eye locations and are resized to 218*178
▪ 9.8 GB for 202,599 original web face images
2016/7/22USC Multimedia Communication Lab 69
Large-scale CelebFaces Attributes (CelebA) Dataset
2016/7/22USC Multimedia Communication Lab 70
Deep Face Dreams
2016/7/22USC Multimedia Communication Lab 71
Representative ImageNeuron Inversion
Mahendran, Aravindh, and Andrea Vedaldi. "Understanding deep image representations by inverting them." Computer Vision and Pattern Recognition (CVPR), 2015
Deep Face Dreams
2016/7/22USC Multimedia Communication Lab 72
Representative Image Neuron InversionMahendran, Aravindh, and Andrea Vedaldi. "Understanding deep image representations by inverting them." Computer Vision and Pattern Recognition (CVPR), 2015
Deep Face Dreams
2016/7/22USC Multimedia Communication Lab 73
Representative Image Neuron Inversion
Mahendran, Aravindh, and Andrea Vedaldi. "Understanding deep image representations by inverting them." Computer Vision and Pattern Recognition (CVPR), 2015
Deep Face Dreams
2016/7/22USC Multimedia Communication Lab 74
Representative Image Neuron Inversion
Mahendran, Aravindh, and Andrea Vedaldi. "Understanding deep image representations by inverting them." Computer Vision and Pattern Recognition (CVPR), 2015
Questions?