deep models for face processing - shanghaitechssds2015.shanghaitech.edu.cn/slides/tutorial... ·...
TRANSCRIPT
Deep Models for Face Processing
with “Big” or “Small” Data
Shiguang Shan
Institute of Computing Technology, Chinese
Academy of Sciences
Ins
titute
of C
om
pu
ting
Tech
no
log
y, C
hin
ese A
cad
em
y o
f Scie
nc
es
Outline
Background A brief history of face recognition in term of benchmark
evolution
Feature learning with CNN (trained with “big” data) For ACM ICMI EmotioW 2014 challenge
For IEEE FG2015 PaSC video-based FR challenge
Auto-encoder for Face X (with “small” data) DAE for face alignment (ECCV14)
DAE for face normalization (CVPR14)
DAE for face super-resolution (ECCV14)
DAE for face detection (under review)
DAE for face cross-domain recognition (under review)
Summary and discussion
2
Ins
titute
of C
om
pu
ting
Tech
no
log
y, C
hin
ese A
cad
em
y o
f Scie
nc
es
A Brief History of FR
Academic milestones ORL, E Yale B, AR: 1990~ (<130 subjects)
Recognition rate: 95%~99% [J.Wright et al, 2008]
Typical methods: linear models (PCA, LDA, SRC)
3
ORL(40 subjects,10 ipp) AR(126 subjects,26ipp)
Ins
titute
of C
om
pu
ting
Tech
no
log
y, C
hin
ese A
cad
em
y o
f Scie
nc
es
A Brief History of FR
Academic Milestones ORL, E Yale B, AR: 1990~ (<130 subjects)
FERET: 1994~2010 (1196 subjects, 2~5ipp) Recognition rates: 99%~94% (Dup.I&II) [S.Xie, S.Shan,
X.Chen, IEEE T IP10]
4
Ins
titute
of C
om
pu
ting
Tech
no
log
y, C
hin
ese A
cad
em
y o
f Scie
nc
es
A Brief History of FR
Academic Milestones ORL, E Yale B, AR: 1990~ (<130 subjects)
FERET: 1994~2010 (1196 subjects, 2~5ipp) Recognition rates: 99%~94% (Dup.I&II) [S.Xie, S.Shan,
X.Chen, IEEE T IP10]
Methods: local Gabor magnitude + local Gabor phase +
Block-based LDA
5
Gabor + BFLD
Gabor + BFLD
S1 S2 SM Sum
S
Ins
titute
of C
om
pu
ting
Tech
no
log
y, C
hin
ese A
cad
em
y o
f Scie
nc
es
6
A Brief History of FR—FERET
1. Claudio A.Perez , LeonardoA.Cament, LuisE.Castillo. Methodological improvement on local Gabor face recognition based on
feature selection and enhanced Borda count. Pattern Recognition 44 (PR2011), 951–963
2. Georgios Tzimiropoulos, Stefanos Zafeiriou, Maja Pantic. Subspace Learning from Image Gradient Orientations. IEEE
Transactions on Pattern Analysis And Machine Intelligence, IEEE T PAMI2012
3. Hieu V. Nguyen, Li Bai, and Linlin Shen, Local Gabor Binary Pattern Whitened PCA: A Novel Approach for Face Recognition
from Single Image Per Person. ICB 2009, LNCS 5558, pp. 269–278, 2009
4. Andrew Wagner, John Wright, Arvind Ganesh, Zihan Zhou, Hossein Mobahi, and Yi Ma. Toward a Practical Face Recognition
System: Robust Alignment and is And Machine Intelligence, IEEE T PAMI2012
5. Ngoc-Son Vu, Alice Caplier. Face Recognition with Patterns of Oriented Edge Magnitudes. ECCV2010
6. A. Timo, H. Abdenour, and P. Matti. Face recognition with Local Binary Patterns. ECCV 2004
Comparative methods Probe sets of FERET
(Released by NIST)
FB FC Dup.I Dup.II
Our method [T IP10] 99% 100% 94% 93%
[1]LGP + Borda Count (PR11) 99.8% 99.5% 89.2% 86.8%
[2]Image Gradient Orientations(T PAMI12) - - 88.9% 85.4%
[3]LGBP+Whitened PCA (ICB09) 98.1% 98.9% 83.8% 81.6%
[4]Oriented Edge Magnitudes (ECCV10) 98.1% 99% 79% 79.1%
[5]Improved SRC (T PAMI12) 96.6% 58.8% 71.6% 61.5%
[6]LBP (ECCV04) 97% 79% 66% 64%
Ins
titute
of C
om
pu
ting
Tech
no
log
y, C
hin
ese A
cad
em
y o
f Scie
nc
es
A Brief History of FR—FRGC
Academic Milestones ORL, E Yale B, AR: 1990~ (<130 subjects)
FERET: 1994~2010 (1196 subjects, 2~5ipp)
FRGC v2.0: 2004~2012 (~500subjects, ~50ipp) VR=96%@ FAR=0.1% [Y.Li, S.Shan, H.Zhang, S.Lao, X.Chen,
ACCV12]
7
Ins
titute
of C
om
pu
ting
Tech
no
log
y, C
hin
ese A
cad
em
y o
f Scie
nc
es
A Brief History of FR—FRGC
Academic Milestones ORL, E Yale B, AR: 1990~ (<130 subjects)
FERET: 1994~2010 (1196 subjects, 2~5ipp)
FRGC v2.0: 2004~2012 (~500subjects, ~50ipp) VR=96%@ FAR=0.1% [Y.Li, S.Shan, H.Zhang, S.Lao, X.Chen,
ACCV12]
Method: local Gabor magnitude + LPQ + Block-LDA
8
Ins
titute
of C
om
pu
ting
Tech
no
log
y, C
hin
ese A
cad
em
y o
f Scie
nc
es
9
A Brief History of FR—FRGC
FRGC test set
Methods
Verification Rate (when FAR=0.1%)
Exp.1 Exp.4
FRGC Baseline (Eigenfaces) 66% 12%
Hybrid Fourier [Hwang 2006] 91% 74%
KFA [Liu 2006] 92% 76%
DCT_EFM [Liu 2008] n/a 84%
Gabor+LDA [Han 2010] 97% 78%
LBP & Gabor + KLDA+SN [Tan 2010] N/A 88%
Our methods [Su 2009] 98% 89%
RTF + RCF [Deng 2010] 99% 93.5%
Our Methods [Li 2012] 99% 96%
[Hwang 06] W. Hwang, et. al, Multiple Face Model of Hybrid Fourier Feature for Large Face Image Set, In CVPR’06.
[Liu 06] C. Liu, Capitalize on dimensionality increasing techniques for improving face recognition performance, In PAMI 2006.
[Liu 08] Z. Liu and C. Liu, Fusion of the complementary Discrete Cosine Features in the YIQ color space for face recognition, in CVIU 2008.
[Han 10] Z. Han, C. Fang, X. Ding, A Discriminated Correlation Classifier for Face Recognition, Proc. of 2010 ACM Sym. on Applied Computing, 2010
[Tan 10] X.Tan, B.Triggs. Enhanced Local Texture Feature Sets for Face Recognition Under Difficult Lighting Conditions. IEEE T IP 19(6), 2010.6
[Deng 13] Deng, W., Hu, J., Guo, J., Cai, W., Feng, D.: Emulating biological strategies for uncontrolled face recognition., PR, 2013
[Li 12]Y.Li, S. Shan, H. Zhang, S. Lao, X. Chen. Fusing Magnitude and Phase Features for Robust Face Recognition, ACCV2012
Ins
titute
of C
om
pu
ting
Tech
no
log
y, C
hin
ese A
cad
em
y o
f Scie
nc
es
A Brief History of FR—MBE2010
Academic Milestones ORL, E Yale B, AR: 1990~ (<130 subjects)
FERET: 1994~2010 (1196 subjects, 2~5ipp)
FRGC v2.0: 2004~2012 (~500subjects, ~50ipp)
NIST MBE 2010 (1.6M subjects, ~2ipp) Scenario: ID photo vs. ID photo
Face identification: 1: N (close set) with N=1.6 million
10
Ins
titute
of C
om
pu
ting
Tech
no
log
y, C
hin
ese A
cad
em
y o
f Scie
nc
es
A Brief History of FR—China Test
Academic Milestones ORL, E Yale B, AR: 1990~ (<130 subjects)
FERET: 1994~2010 (1196 subjects, 2~5ipp)
FRGC v2.0: 2004~2012 (~500subjects, ~50ipp)
NIST MBE 2010 (1.6M subjects, ~2ipp)
Test on China 2nd generation ID card photos 2010
1: N(10 M faces): ~90% (Sagem solutions)
1: N (2.7M faces): ~92% (Ours, 100K probes)
11
Ins
titute
of C
om
pu
ting
Tech
no
log
y, C
hin
ese A
cad
em
y o
f Scie
nc
es
A Brief History of FR—LFW
Academic Milestones ORL, E Yale B, AR: 1990~ (<130 subjects)
FERET: 1994~2010 (1196 subjects, 2~5ipp)
FRGC v2.0: 2004~2012 (~500subjects, ~50ipp)
NIST MBE 2010 (1.6M subjects, ~2ipp)
Test on China 2nd generation ID card photos 2010
LFW: since 2007 (~5749 celebrities, 1680 >2ipp) 95.17% [D.Chen, X. Cao, F. Wen, J. Sun, CVPR13]
97.35% [Y.Taigman, M.Yang, M.Ranzato, L.Wolf. CVPR14]
97.45% [Y. Sun, X. Wang, and X. Tang, CVPR14]
>99.5% [Deep ID3, Face++, Tencent, insky.so, …]
99.63% [FaceNet, F. Schroff, D. Kalenichenko, J.Philbin, CVPR15]
12
Ins
titute
of C
om
pu
ting
Tech
no
log
y, C
hin
ese A
cad
em
y o
f Scie
nc
es
A Brief History of FR—LFW
Academic Milestones ORL, E Yale B, AR: 1990~ (<130 subjects)
FERET: 1994~2010 (1196 subjects, 2~5ipp)
FRGC v2.0: 2004~2012 (~500subjects, ~50ipp)
NIST MBE 2010 (1.6M subjects, ~2ipp)
Test on China 2nd generation ID card photos 2010
LFW: since 2007 (~5749 celebrities, 1680 >2ipp) 95.17% [D.Chen, X. Cao, F. Wen, J. Sun, CVPR13]
Method: High dimensional LBP + Joint Bayesian
97.35% [Y.Taigman, M.Yang, M.Ranzato, L.Wolf. CVPR14]
97.45% [Y. Sun, X. Wang, and X. Tang, CVPR14]
>99.5% [Deep ID3, Face++, Tencent, insky.so, …]
99.63% [FaceNet, F. Schroff, D. Kalenichenko, J.Philbin, CVPR15]
13
Method:
Deep
Learning
Ins
titute
of C
om
pu
ting
Tech
no
log
y, C
hin
ese A
cad
em
y o
f Scie
nc
es
More About LFW Evaluation
Labeled Face in the Wild (LFW)
Face Verification (1:1) on celebrity faces Photos from Yahoo news
Evaluation protocol Training set: unrestricted
Testing set 6000 image pairs
Half of the same person
Half from different persons
14
Huang G B, Ramesh M, Berg T, et al. Labeled faces in the wild: A database for
studying face recognition in unconstrained environments. Technical Report,
University of Massachusetts, Amherst, 2007.
Ins
titute
of C
om
pu
ting
Tech
no
log
y, C
hin
ese A
cad
em
y o
f Scie
nc
es
More About LFW Evaluation
2014: DeepFace [1] (Facebook)
Training: 4K subjects,4.4M images
2015: DeepID2+ [2]
Training:
10K celebrities, 202K images
15
DeepFaceDeepID2+
[1] Taigman Y, Yang M, Ranzato M A, et al. Deepface: Closing the gap to human-
level performance in face verification. CVPR, 2014.
[2] Sun Y, Wang X, Tang X. Deeply learned face representations are sparse,
selective, and robust. arXiv preprint, 2014.
Ins
titute
of C
om
pu
ting
Tech
no
log
y, C
hin
ese A
cad
em
y o
f Scie
nc
es
Is Face Recognition Solved?
No! Many distinct scenarios.
Some almost solved, some far from solved.
Scenarios almost solved Face verification in controlled environment with cooperative
users
Time attendance, access control (low security requirement),
student verification in exams
Duplicate identity checking based on face photos
MPS duplicate passport checking (0.2 billion faces)
VIP watch list screening
Bank, shops, stores…
Celebrity face retrieval
16
Ins
titute
of C
om
pu
ting
Tech
no
log
y, C
hin
ese A
cad
em
y o
f Scie
nc
es
Is Face Recognition Solved?
No! Many distinct scenarios.
Some almost solved, some far from solved.
Scenarios almost solved Face verification in controlled environment with cooperative
users
Time attendance, access control (low security requirement),
student verification in exams
Unsolved: twins, large plastic surgery
Duplicate identity checking based on face photos
MPS duplicate passport checking (0.2 billion faces)
Unsolved: naturally similar faces
Whitelist (e.g.VIP) screening
Bank, shops, stores…(Half a loaf is better than no bread)
Celebrity face retrieval Recall rate is not seriously considered…
17
Ins
titute
of C
om
pu
ting
Tech
no
log
y, C
hin
ese A
cad
em
y o
f Scie
nc
es
Is Face Recognition Solved?
No! Many distinct scenarios.
Some almost solved, some far from solved.
Scenarios not solved Face verification with very high security demand
Scenarios: payment with face, access control..
Face verification against ID photos
E.g., based on China 2nd generation ID card photos
Blacklist screening for video surveillance
18
Ins
titute
of C
om
pu
ting
Tech
no
log
y, C
hin
ese A
cad
em
y o
f Scie
nc
es
Is Face Recognition Solved?
No! Many distinct scenarios.
Some almost solved, some far from solved.
Scenarios not solved Face verification with very high security demand
Scenarios: payment with face, access control..
Not convenient enough As false reject rate >30% @FAR=0.01%
Anti-spoof is hard: photo, video, synthesized video…
Face verification against ID photos
E.g., China 2nd generation ID card photos
False reject rate >30% @FAR=0.1% with large photo
False reject rate >50% @FAR=0.1% with on-chip photo
Blacklist screening for video surveillance Recognition rate <30% @FAR=0.01%
surveillance videos: low-quality
Lack of large-scale training/testing data for this kind of scenario 19
Ins
titute
of C
om
pu
ting
Tech
no
log
y, C
hin
ese A
cad
em
y o
f Scie
nc
es
Advertisement: a new database
COX video face database
http://vipl.ict.ac.cn/resources/datasets/cox-
face-dataset
Features of COX
1000 subjects, each 1 high quality still image
3 low quality video clips
from 3 camcorders
(Aim to) simulate
video surveillance
Evaluation protocols
20
Ins
titute
of C
om
pu
ting
Tech
no
log
y, C
hin
ese A
cad
em
y o
f Scie
nc
es
Advertisement: a new database
COX: Still image vs. video clips
Verification rate <40% @ FAR=0.1%
21
Ins
titute
of C
om
pu
ting
Tech
no
log
y, C
hin
ese A
cad
em
y o
f Scie
nc
es
Outline
Background A brief history of face recognition in term of benchmark
evolution
Feature learning with CNN (trained with “big” data) For ACM ICMI EmotioW 2014 challenge
For IEEE FG2015 PaSC video-based FR challenge
Auto-encoder for Face X (with “small” data) DAE for face alignment (ECCV14)
DAE for face normalization (CVPR14)
DAE for face super-resolution (ECCV14)
DAE for face detection (under review)
DAE for face cross-domain recognition (under review)
Summary and discussion
22
M. Liu, R. Wang, S. Li, Z.Huang, S.Shan, X. Chen. Combining Multiple Kernel Methods
on Riemannian Manifold for Emotion Recognition in the Wild. ACM ICMI 2014
Ins
titute
of C
om
pu
ting
Tech
no
log
y, C
hin
ese A
cad
em
y o
f Scie
nc
es
Feature Learning with CNN
CNN can learn features with surprising results
if only you have big data!
Feature engineering vs. feature learning Manually-designed filters vs. learned filters
Former: low-level features (e.g. gradient…)
Latter: mid-level, high-level, increasing explanatory
and abstract, closer to semantics
Learned but sharable for different tasks Feature learned with task-specific objective
Nevertheless, sharable and easy to transfer
Two examples of our practices
23
Ins
titute
of C
om
pu
ting
Tech
no
log
y, C
hin
ese A
cad
em
y o
f Scie
nc
es
EmotioW 2014: Task
Task
Classify a sample audio-video clip into one of
the seven categories Neutral, anger, disgust, fear, happy, sad, surprise
Challenge
Close-to-real-world conditions Large variations e.g. head pose, illumination,
partial occlusion, etc.
24
Ins
titute
of C
om
pu
ting
Tech
no
log
y, C
hin
ese A
cad
em
y o
f Scie
nc
es
EmotioW 2014: Data
Challenging data
AFEW* 4.0 database audio-video clips collected from movies showing
close-to-real-world conditions
25
Attribute of AFEW 4.0 Description
Length of sequences 300-5400ms
Number of annotators 3
Emotion categories Anger, disgust, fear, happiness, neutral, sadness, and surprise
Audio/Video format Audio: WAV; Video: AVI
# of samples 1368
# of subjects 428
# of movies 111
*Acted Facial Expression in Wild
Ins
titute
of C
om
pu
ting
Tech
no
log
y, C
hin
ese A
cad
em
y o
f Scie
nc
es
EmotioW 2014: Protocols
Evaluation protocols Dataset division: training, validation, and testing
The test labels were unknown.
Either audio/video modality or both can be used.
26
Set # of subjects Min. Age Max. Age Avg. Age # of Males # of Females
Train 177 5 76 34 102 75
Val 136 10 70 35 78 58
Test 115 5 88 34 64 51
Anger Digust Fear Happiness Neutral Sadness Surprise
Train 92 66 66 105 102 82 54
Val 59 39 44 63 61 59 46
Test 58 26 46 81 117 53 26
Ins
titute
of C
om
pu
ting
Tech
no
log
y, C
hin
ese A
cad
em
y o
f Scie
nc
es
Our method
27
Linear Subspace Covariance Matrix Gaussian Distribution
Video (Image Set) ModelingImage Feature on Aligned Faces
Dense SIFT
…
HOG
…
DCNN
Stage 1: Emotion Video Representation
Stage 2: Emotion
Video
Recognition
Classification on Riemannian Manifold via Kernel SVM/LR/PLS
Score-level
Fusion
M. Liu, R. Wang, S. Li, Z.Huang, S.Shan, X. Chen. Combining Multiple Kernel Methods
on Riemannian Manifold for Emotion Recognition in the Wild. ACM ICMI 2014
Ins
titute
of C
om
pu
ting
Tech
no
log
y, C
hin
ese A
cad
em
y o
f Scie
nc
es
Our method
Image features Aligned face images: 64x64; Features: HOG, dense SIFT, DCNN.
DCNN
CaffeNet trained on CFW database Trained over 150,000 face images from 1520 subjects
Identities are served as supervised label in the deep networks
Architecture 3@237x237 > 96@57x57 > 96@28x28 > 256@28x28
> 384@14x14 > 256@14x14 > 256@7x7 > 4096 > 1520 Output of the last convolutional layer as final image features: 256x7x7=12, 544
HOG
Block size: 16x16; stride: 8; # of blocks: 7x7=49
# of cells per block: 2x2; # of bins: 9; # of total dims: 2x2x9x49=1764
Dense SIFT
Block size: 16x16; stride: 8; # of points: 7x7=49
# of dims per point: 4x4x8=128; # of total dims: 128x49=6272
28
Ins
titute
of C
om
pu
ting
Tech
no
log
y, C
hin
ese A
cad
em
y o
f Scie
nc
es
Our Results
Combine multiple features
29
MethodsAccuracy (%)
Validation set Test set
Baseline (provided by EmotiW organizers) 34.40 33.70
Audio (OpenSMILE Toolkit) 30.73 --
Video
HOG 38.01 --
Dense SIFT 43.94 --
DCNN (Caffe-CFW) 43.40 --
HOG + Dense SIFT 44.47 --
HOG + Dense SIFT + DCNN (Caffe-CFW) 45.28 --
Audio + Video ( HOG+Dense SIFT ) 46.36 46.68
Audio+Video ( HOG + Dense SIFT + DCNN (Caffe-CFW) ) 48.52 50.37
Ins
titute
of C
om
pu
ting
Tech
no
log
y, C
hin
ese A
cad
em
y o
f Scie
nc
es
Final Results of Competition
30
Ins
titute
of C
om
pu
ting
Tech
no
log
y, C
hin
ese A
cad
em
y o
f Scie
nc
es
Outline
Background A brief history of face recognition in term of benchmark
evolution
Feature learning with CNN (trained with “big” data) For ACM ICMI EmotioW 2014 challenge
For IEEE FG2015 PaSC video-based FR challenge
Auto-encoder for Face X (with “small” data) DAE for face alignment (ECCV14)
DAE for face normalization (CVPR14)
DAE for face super-resolution (ECCV14)
DAE for face detection (under review)
DAE for face cross-domain recognition (under review)
Summary and discussion
31
Ins
titute
of C
om
pu
ting
Tech
no
log
y, C
hin
ese A
cad
em
y o
f Scie
nc
es
FG 2015 Video FR Challenge
Task: video-to-video face verification
Exp. 1: Controlled case Video-to-video verification
1920*1080 video captured by mounted camera
Exp. 2: Handheld case Video-to-video verification
Varying resolution from 640*480~1280*720
Videos from a mix of different handheld point-and-
shoot video cameras
32
Ins
titute
of C
om
pu
ting
Tech
no
log
y, C
hin
ese A
cad
em
y o
f Scie
nc
es
FG 2015 Video FR Challenge
Videos for testing in the PaSC datasets
[Beveridge, BTAS’13]
33
Ins
titute
of C
om
pu
ting
Tech
no
log
y, C
hin
ese A
cad
em
y o
f Scie
nc
es
Results in IJCB 2014
Verification rates at FAR=1% for the video-to-video
(Exp. 1) and video-to-still (Exp.2) tasks.
[Beveridge, IJCB’14]Handheld experiment
Best method: Haoxiang Li, Gang Hua. Eigen Probabilistic Elastic Part
(Eigen-PEP) model, CVPR13/ICCV1334
Ins
titute
of C
om
pu
ting
Tech
no
log
y, C
hin
ese A
cad
em
y o
f Scie
nc
es
Our Method
DCNN (single frame feature)
HERML(video representation and classification)
Hybrid Euclidean-and-Riemannian Metric Learning
(HERML) [Huang, Wang, Shan, Chen, ACCV’14]
ℝ𝑑
𝑆𝑦𝑚𝑑+1+
𝑆𝑦𝑚𝑑+
Gaussian
Covariance
Mean
Video
(a) Mul. statistics (b) Hetero. spaces
KLDAFusing on
Score level
(c) KDA Leaning
KLDA
KLDA
Frame
DCNN [Jia’13]
Layer 1-2: Conv
Input Image
Layer 1-1: Conv
Layer 2-1: Conv
Layer 2-2: Conv
Layer 3-1: Conv
Layer 3-2: Conv
Layer 4-1: Conv
Layer 4-2: Conv
Layer 6-1: Full
Layer 6-2: Full
Softmax Output
Layer 1-3: Conv + Pool
Layer 2-3: Conv + Pool
Layer 3-3: Conv + Pool
Layer 4-3: Conv + Pool
Layer 5-1: Conv
Layer 5-2: Conv
35
Ins
titute
of C
om
pu
ting
Tech
no
log
y, C
hin
ese A
cad
em
y o
f Scie
nc
es
Training Models
Training DCNN Caffe, Jia’1314 Cov. Layers (from 5)
Pre-train: CFW
Start learning rate: 0.01
153,461 images from 1520 persons
Fine-tune: PaSC training set + COX
Start learning rate: 0.001
PaSC training set 170 persons, 38113 images
COX training set (our own, surveillance-like videos) 1000 persons, 147,737 video frames
Features exploited finally 2,048 dimensional features of fc 6-2 layer for
each frame
36
Ins
titute
of C
om
pu
ting
Tech
no
log
y, C
hin
ese A
cad
em
y o
f Scie
nc
es
Training Models
Training HERML
1,165 videos from 470 person, from two
heterogeneous datasets PaSC training set
170 persons, 265 videos
COX training set 300 persons, 900 videos (3 videos/person)
Final feature dimensions (per video)
1320 (440*3)-dimensional (KLDA features)
37
Ins
titute
of C
om
pu
ting
Tech
no
log
y, C
hin
ese A
cad
em
y o
f Scie
nc
es
Evaluation Results
The deeper the better
control:41.40%,handheld:41.62%
Input Image
Layer 3-1: Conv
Layer 3-2: Conv
Layer 4-1: Full
Layer 4-2: Full
Softmax Output
Layer 1: Conv + Pool
Layer 2: Conv + Pool
Layer 3-3: Conv + Pool
control: 47.41%
handheld: 48.02%
Input Image
Layer 1-1: Conv
Layer 2-1: Conv
Layer 3-1: Conv
Layer 4-1: Conv
Layer 5-1: Full
Layer 5-2: Full
Softmax Output
Layer 1-3: Conv + Pool
Layer 2-3: Conv + Pool
Layer 3-3: Conv + Pool
Layer 4-3: Conv + Pool
control: 54.76%
handheld: 56.20%
Layer 1-2: Conv
Input Image
Layer 1-1: Conv
Layer 2-1: Conv
Layer 2-2: Conv
Layer 3-1: Conv
Layer 3-2: Conv
Layer 4-1: Conv
Layer 4-2: Conv
Layer 6-1: Full
Layer 6-2: Full
Softmax Output
Layer 1-3: Conv + Pool
Layer 2-3: Conv + Pool
Layer 3-3: Conv + Pool
Layer 4-3: Conv + Pool
Layer 5-1: Conv
Layer 5-2: Conv
control:46.61%,handheld:46.23%
DCNN + HERML (set models)
control: 56.20%,handheld:54.41%
control:58.63%,handheld:59.14%
DCNN for single frame
38
Ins
titute
of C
om
pu
ting
Tech
no
log
y, C
hin
ese A
cad
em
y o
f Scie
nc
es
Primary Results
Image features
HOG < Dense SIFT << DCNN
MethodHOG Dense SIFT DCNN
Control Handheld Control Handheld Control Handheld
HERML 25.26 19.28 33.82 28.93 58.63 59.14
39
Ins
titute
of C
om
pu
ting
Tech
no
log
y, C
hin
ese A
cad
em
y o
f Scie
nc
es
Outline
Background A brief history of face recognition in term of benchmark
evolution
Feature learning with CNN (trained with “big” data) For ACM ICMI EmotioW 2014 challenge
For IEEE FG2015 PaSC video-based FR challenge
Auto-encoder for Face X (with “small” data) DAE for face alignment (ECCV14)
DAE for face normalization (CVPR14)
DAE for face super-resolution (ECCV14)
DAE for face detection (under review)
DAE for cross-domain face recognition (under review)
Summary and discussion
40
Ins
titute
of C
om
pu
ting
Tech
no
log
y, C
hin
ese A
cad
em
y o
f Scie
nc
es
Auto-encoder for Face X
CNN is good, but
Need big data to train
Slow, not only training, but also testing
Mainly good for feature learning
Auto-encoder
Simple
Fast in both training and testing
A general non-linear transform
Not so good for feature learning
Some example practices
41
Ins
titute
of C
om
pu
ting
Tech
no
log
y, C
hin
ese A
cad
em
y o
f Scie
nc
es
Problem to Solve
Face Alignment
Predict facial landmarks from detected face
Detected face
region I(u,v)
Facial landmarks
S=(x1,y1, x2, y2, …, xL, yL)
42
𝑺 = 𝑯 𝑰 , 𝑰 ∈ 𝑹𝒘∗𝒉, 𝑺 ∈ 𝑹𝟐𝑳,
Goal
J. Zhang, S. Shan, M. Kan, X. Chen. Coarse-to-Fine Auto-Encoder Networks
(CFAN) for Real-Time Face Alignment. ECCV2014 (oral)
Ins
titute
of C
om
pu
ting
Tech
no
log
y, C
hin
ese A
cad
em
y o
f Scie
nc
es
Challenges
H: a complex nonlinear mapping
Large appearance & shape variations Head pose
Expressions
Illumination
Partial occlusion
43
Ins
titute
of C
om
pu
ting
Tech
no
log
y, C
hin
ese A
cad
em
y o
f Scie
nc
es
Related Works
ASM & AAM [Cootes’95; Gu’08; Cootes’01; Matthews’04 ]
Sensitive to initial shapes
Sensitive to noise
Hard to cover complex variations
DCNN [Sun’13; Toshev’14]
Shape regression model
Linear Regression [X. Chai, S. Shan, W. Gao. ICASSP’03]
CPR,ESR,RCPR [Dollar’10; Cao’12; Burgos-Artizzu’13]
DRMF [Asthana’13]
SDM [Xiong’13]
44
𝑺 = 𝑾𝑰
Ins
titute
of C
om
pu
ting
Tech
no
log
y, C
hin
ese A
cad
em
y o
f Scie
nc
es
Motivation
Directly apply Stacked Auto-Encoder
(SAE)? OK, but not good. Why?
Easily overfit to small data Typically only thousands of images with
landmark annotations
Our ideas – exploiting priors
Handcrafted features Avoid convolution (slow, bid data…)
SIFT, shape-indexed
Better initialization
Coarse to fine: piecewise non-linear
45
𝐼
S
Ins
titute
of C
om
pu
ting
Tech
no
log
y, C
hin
ese A
cad
em
y o
f Scie
nc
es
Schema of Coarse-to-Fine AE Networks
Our Method
𝐼
𝑺𝟏
∅(𝑆0)
𝑺𝟐
∅(𝑆1)
𝑺𝟑
∅(𝑆2)
Nonlinear𝑯𝟎 Nonlinear 𝑯𝟏 Nonlinear 𝑯𝟐 Nonlinear 𝑯𝟑
𝑺𝟎
Global SAN Local SANs
SAN: Stacked Auto-encoder Network46
Ins
titute
of C
om
pu
ting
Tech
no
log
y, C
hin
ese A
cad
em
y o
f Scie
nc
es
Our Method
𝐼
𝑆0
𝑆0
∆𝑆2
𝑆1 + ∆𝑆2
𝑆2
… …
∅(𝑆1)
∆𝑆3
𝑆2 + ∆𝑆3
𝑆3
… …
∅(𝑆2)
∆𝑆1
𝑆0 + ∆𝑆1
𝑆1
… …
∅(𝑆0)
47
Global SAN Local SAN#1 Local SAN#2 Local SAN#3
Ins
titute
of C
om
pu
ting
Tech
no
log
y, C
hin
ese A
cad
em
y o
f Scie
nc
es
Global SAN
Mapping 𝐻0 from image 𝐼 to shape 𝑆.
𝐻0∶ 𝑆 ← 𝐼
Model 𝐻0 as a Stacked Auto-encoder:
𝐻0∗ = argmin
𝐻0𝑆 − 𝑓𝑘(𝑓𝑘−1(…𝑓1(𝐼))) 2
2+𝛼 𝑖=1𝑘 𝑊𝑖 𝐹
2
Our Method
Regularization
𝑓𝑘 𝑎𝑘−1 = 𝑊𝑘𝑎𝑘−1 + 𝑏𝑘 ≜ 𝑆0
𝑓𝑖 𝑎𝑖−1 = σ 𝑊𝑖𝑎𝑖−1 + 𝑏𝑖 ≜ 𝑎𝑖 , 𝑖 = 1, … , 𝑘 − 1
Regression
𝐼
𝑆0
48
Ins
titute
of C
om
pu
ting
Tech
no
log
y, C
hin
ese A
cad
em
y o
f Scie
nc
es
Local SAN
Initialize shape 𝑆0 from global SAN.
Predict shape deviation with AE Refine the shape with local features
∅(𝑆0): 𝑆0 shape indexed local features PCA of concatenated SIFT features
𝐻1∗ = argmin
𝐻1∆𝑆1 − ℎ𝑘
1 …ℎ11 ∅ 𝑆0
2
2+ 𝛼
𝑖=1
𝑘
𝑊𝑖1𝐹
2
∆𝑆1 = 𝑆 − 𝑆0
Our Method
∆𝑆𝑗
𝑆0 + ∆𝑆𝟏
𝑆1
… …
∅(𝑆0)
49
Ins
titute
of C
om
pu
ting
Tech
no
log
y, C
hin
ese A
cad
em
y o
f Scie
nc
es
Our Method
𝑺𝟎 𝑺𝟏 𝑺𝟐 𝑺𝟑
Coarse-to-fine Cascade
𝐻𝑗∗ = argmin
𝐻𝑗∆𝑆𝑗 − ℎ𝑘
𝑗…ℎ1𝑗∅ 𝑺𝒋−𝟏
2
2
+ 𝛼
𝑖=1
𝑘
𝑊𝑖𝑗
𝐹
2
Larger search region/step Smaller search region/step
𝑗: index of local SAN
𝑘: index of hidden layer
50
Ins
titute
of C
om
pu
ting
Tech
no
log
y, C
hin
ese A
cad
em
y o
f Scie
nc
es
Details
Global SAN
Input: 50x50 image, vectorized to 2500
3 hidden layers:1600,900,400
Local SANs
Face resolution:
80x80140x140140x140
Shape indexed SIFT feature 128*68=8704, dimension reduced by PCA to
1695, 2418, 2440 respectively for 3 SANs
Output: 136D shape deviation
3 hidden layers:1296,784,40051
Ins
titute
of C
om
pu
ting
Tech
no
log
y, C
hin
ese A
cad
em
y o
f Scie
nc
es
Experiments
Datasets for evaluation XM2VTS [Messer’99]
Test: 2360 face images
Training: 3478 images (LFPW training set,Helen,AFW)
LFPW [Belhumeur’11]
Test: 300 test images collected from wild condition
Training: 3478 images (LFPW training set,Helen,AFW)
HELEN [Le’12]
Test: 330 images in the wild
Training: 3148 images (LFPW and Helen training, AFW)
AFW [Zhu’12]
205 images with 468 faces collected from the wild
52
Ins
titute
of C
om
pu
ting
Tech
no
log
y, C
hin
ese A
cad
em
y o
f Scie
nc
es
Experiments
Evaluation of different SANs
Performance gain of each SAN
(Conduct on LFPW)
ms
53
0.25
7.63 7.28 7.68
0123456789
GlobalSAN
LocalSAN 1
LocalSAN 2
LocalSAN 3
Run Time (ms)
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.05 0.1 0.15 0.2 0.25 0.3
Dat
a P
rop
ort
ion
NRMSE
Mean Shape
Global SAN
Local SAN 1
Local SAN 2
Local SAN 3
Ins
titute
of C
om
pu
ting
Tech
no
log
y, C
hin
ese A
cad
em
y o
f Scie
nc
es
Experiments(3/8)
Comparative Methods Local Models with Regression Fitting
SDM [Xiong’13]
DRMF [Asthana’13]
Tree-structured Models Zhu et al. [Zhu’12]
Yu et al. [Yu’13]
Deep Model DCNN [Sun’13]
54
Ins
titute
of C
om
pu
ting
Tech
no
log
y, C
hin
ese A
cad
em
y o
f Scie
nc
es
Experimental Result(4/8)
Performance comparisons on HELEN
55
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.05 0.1 0.15 0.2
Dat
a P
rop
ort
ion
NRMSE
Zhu et al.
Yu et al.
DRMF
SDM
Our method
Ins
titute
of C
om
pu
ting
Tech
no
log
y, C
hin
ese A
cad
em
y o
f Scie
nc
es
Experimental Result(5/8)
Performance comparisons on LFPW
56
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.05 0.1 0.15 0.2
Dat
a P
rop
ort
ion
NRMSE
Zhu et al.
Yu et al.
DRMF
SDM
Our method
Ins
titute
of C
om
pu
ting
Tech
no
log
y, C
hin
ese A
cad
em
y o
f Scie
nc
es
Experimental Result(6/8)
Performance comparisons on XM2VTS
57
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0.03 0.04 0.05 0.06 0.07 0.08
Dat
a P
rop
ort
ion
NRMSE
Zhu et al.
Yu et al.
DRMF
SDM
Our method
Ins
titute
of C
om
pu
ting
Tech
no
log
y, C
hin
ese A
cad
em
y o
f Scie
nc
es
Experimental Result(7/8)
Comparisons with DCNN* [Sun et al., CVPR’13]
Note: The performance is evaluated in terms of five common landmarks
XM2VTS LFPW HELEN
58
Ins
titute
of C
om
pu
ting
Tech
no
log
y, C
hin
ese A
cad
em
y o
f Scie
nc
es
Experimental Result(8/8)
Pose Expression Beard Sunglass Occlusion
59
Ins
titute
of C
om
pu
ting
Tech
no
log
y, C
hin
ese A
cad
em
y o
f Scie
nc
es
Global SAN achieves more accurate
initialization
SAE well characterizes the non-linearity
from appearance to face shape
Coarse-to-fine strategy is effective
Alleviate the local minimum problem
Impressive improvement and real-time
performance
CFAN Summary
60
J. Zhang, S. Shan, M. Kan, X. Chen. Coarse-to-Fine Auto-Encoder Networks (CFAN)
for Real-Time Face Alignment. ECCV2014 (oral)
Ins
titute
of C
om
pu
ting
Tech
no
log
y, C
hin
ese A
cad
em
y o
f Scie
nc
es
Outline
Background A brief history of face recognition in term of benchmark
evolution
Feature learning with CNN (trained with “big” data) For ACM ICMI EmotioW 2014 challenge
For IEEE FG2015 PaSC video-based FR challenge
Auto-encoder for Face X (with “small” data) DAE for face alignment (ECCV14)
DAE for face normalization (CVPR14)
DAE for face super-resolution (ECCV14)
DAE for face detection (under review)
DAE for cross-domain face recognition (under review)
Summary and discussion
61
M. Kan, S. Shan, H. Chang, X. Chen. Stacked Progressive Auto-Encoder
(SPAE) for Face Recognition Across Poses. CVPR2014
Ins
titute
of C
om
pu
ting
Tech
no
log
y, C
hin
ese A
cad
em
y o
f Scie
nc
es
Problem and Existing Solutions
Face Recognition Across Pose
Challenges Appearance difference caused by pose, even
larger than that due to identity
Existing Solutions Pose-invariant feature representations
Virtual images at target pose Geometry-based: implicit/explicit 3D recovery
Learning-based: in 2D
√×62
Ins
titute
of C
om
pu
ting
Tech
no
log
y, C
hin
ese A
cad
em
y o
f Scie
nc
es
Regression-based Methods
Predict view from one pose to another
A non-linear transform
Globally linear regression
𝐴𝑃 𝐴𝑃
Learning Predicting
Φ0
Φ𝑃
X. Chai, S. Shan, X. Chen and W. Gao. Locally linear regression for pose-
invariant face recognition. IEEE T IP (2007).63
Ins
titute
of C
om
pu
ting
Tech
no
log
y, C
hin
ese A
cad
em
y o
f Scie
nc
es
Regression-based Methods
Predict view from one pose to another
Globally linear regression Locally
linear regression
X. Chai, S. Shan, X. Chen and W. Gao. Locally linear regression for pose-
invariant face recognition. IEEE T IP (2007).64
Ins
titute
of C
om
pu
ting
Tech
no
log
y, C
hin
ese A
cad
em
y o
f Scie
nc
es
Motivation
How about deep model directly?
Stacked de-noising Auto-Encoder Regard non-fontal view as
contaminated version of frontal view
Unfortunately, fail again
Complex non-linear model
Easily overfit to “Small” data
Our idea -- priors
Pose changes smoothly Stage-wise non-linear
Progressively reach the final goal65
…
…
…
input layer
output layer
encoder 𝒇𝟏
encoder 𝒇𝟐
…
encoder 𝒇𝟑
…
decoder 𝒈𝟑
decoder 𝒈𝟐
decoder 𝒈𝟏
Ins
titute
of C
om
pu
ting
Tech
no
log
y, C
hin
ese A
cad
em
y o
f Scie
nc
es
Our Method
Basic idea
Stacking multiple Progressive single-layer
Auto-Encoders
Each layer maps non-frontal faces to
another with smaller pose
…
…
…
input layer
output layer
decoder 𝐠𝟏
decoder 𝒈𝟐
encoder 𝒇𝟏
encoder 𝒇𝟐
…
encoder 𝒇𝟑
…
decoder 𝒈𝟑
[-45o , +45o]
[-30o , +30o]
[-15o , +15o]
[ 0o]
…
…
66
Ins
titute
of C
om
pu
ting
Tech
no
log
y, C
hin
ese A
cad
em
y o
f Scie
nc
es
Our Method
Basic idea
Take layer#1 as example
…
…
…
input layer
output layer
decoder 𝐠𝟏
decoder 𝒈𝟐
encoder 𝒇𝟏
encoder 𝒇𝟐
…
encoder 𝒇𝟑
…
decoder 𝒈𝟑
[-45o , +45o]
[-30o , +30o]
[-15o , +15o]
[ 0o]
…
…
p(xoutput) = 30o, if p(xinput) >= 30o
p(xoutput) = p(xinput), if p(xinput) < 30o
No need pose
estimation for testing
67
Ins
titute
of C
om
pu
ting
Tech
no
log
y, C
hin
ese A
cad
em
y o
f Scie
nc
es
Our Method
Discussion
Medium goals restrict the model, thus
alleviate overfitting Multi-view database provides the medium goals
Otherwise, too many feasible solutions
output virtual
frontal view
input non-frontal
face image
68
Ins
titute
of C
om
pu
ting
Tech
no
log
y, C
hin
ese A
cad
em
y o
f Scie
nc
es
Our Method
Discussion Medium goals restrict the model, thus alleviate
overfitting with small data Multi-view database provides the medium goals
Otherwise, too many feasible solutions
Cons Not general enough: special needs on training set
With face images taken under multiple poses
CMU PIE or Multi-PIE works for this purpose
Pros Big data is not needed
Pose no need to be known
69
Ins
titute
of C
om
pu
ting
Tech
no
log
y, C
hin
ese A
cad
em
y o
f Scie
nc
es
Our Method
Step1: optimize each single-layer progressive AE
Step2: fine-tune the stacked deep network
Step3: output few topmost hidden layers as pose-robust features
Step4: supervised feature extraction via Fisher Linear
Discriminant analysis (FLD)
Optimization method: CG (conjugate gradient)
70
Ins
titute
of C
om
pu
ting
Tech
no
log
y, C
hin
ese A
cad
em
y o
f Scie
nc
es
Experimental Results
71
Ins
titute
of C
om
pu
ting
Tech
no
log
y, C
hin
ese A
cad
em
y o
f Scie
nc
es
Experimental Results
72
Ins
titute
of C
om
pu
ting
Tech
no
log
y, C
hin
ese A
cad
em
y o
f Scie
nc
es
Experimental Results
Experiments on Multi-PIE
[-45o, -30o, -15o, 0o, +15o, +30o , +45o]
200 subjects (7ipp) for training (4207 images)
137 subjects (7ipp) for testing (no overlap) Gallery: frontal; probes: images of other poses
5000 neurons (all layers)
Experiments on FERET [-60o, -45o, -30o, -15o, 0o, +15o, +30o , +45o, -60o]
100 subjects (9ipp) for training
100 subjects (9ipp) for testing (no overlap) Gallery: frontal; probes: images of other poses
73
Ins
titute
of C
om
pu
ting
Tech
no
log
y, C
hin
ese A
cad
em
y o
f Scie
nc
es
Experimental Results
Comparison on Multi-PIE
Comparison on FERET
74
Ins
titute
of C
om
pu
ting
Tech
no
log
y, C
hin
ese A
cad
em
y o
f Scie
nc
es
SPAE Summary
SPAE performs better than other 2D methods, and comparable to 3D ones
SPAE can narrow down pose variations layer by layer, along pose variation manifold
SPAE needs no pose estimation of test image
Prior domain knowledge does help the design of deep network
75
Ins
titute
of C
om
pu
ting
Tech
no
log
y, C
hin
ese A
cad
em
y o
f Scie
nc
es
Outline
Background A brief history of face recognition in term of benchmark
evolution
Feature learning with CNN (trained with “big” data) For ACM ICMI EmotioW 2014 challenge
For IEEE FG2015 PaSC video-based FR challenge
Auto-encoder for Face X (with “small” data) DAE for face alignment (ECCV14)
DAE for face normalization (CVPR14)
DAE for image super-resolution (ECCV14)
DAE for face detection (under review)
DAE for cross-domain face recognition (under review)
Summary and discussion
76
Zhen Cui, Hong Chang, Shiguang Shan, Bineng Zhong, and Xilin Chen. Deep
Network Cascade for Image Super-resolution. ECCV 2014
Ins
titute
of C
om
pu
ting
Tech
no
log
y, C
hin
ese A
cad
em
y o
f Scie
nc
es
Problem to Solve
77
Ins
titute
of C
om
pu
ting
Tech
no
log
y, C
hin
ese A
cad
em
y o
f Scie
nc
es
Our Method
Layer-wise non-linear
Stacked (NLSS+CLA)
NLSS
Non-local self-similarity
Patch reconstruction
CLA
Collaborative Local
Auto-encoder
Denoising reconstructed
patches via AE jointly
78
Deep Network Cascade (DNC)
Ins
titute
of C
om
pu
ting
Tech
no
log
y, C
hin
ese A
cad
em
y o
f Scie
nc
es
Our Method
NLSS
Reconstruct with K
nearest neighbors
CLA--modified AE with:
Sparse constraints
Compatibility constraint on overlapping
patches
79
One layer of DNC
Ins
titute
of C
om
pu
ting
Tech
no
log
y, C
hin
ese A
cad
em
y o
f Scie
nc
es
Results
Comparisons with previous methods in
terms of PSNR/SSIM
80
[15] Kim, et al. Single-image super-resolution using sparse regression and natural image prior. T PAMI 2010
[22] Lu, et al. Geometry constrained sparse coding for single image super-resolution. CVPR 2012
[33] Yang et al. Image super-resolution as sparse representation of raw image patches. CVPR 2008
Ins
titute
of C
om
pu
ting
Tech
no
log
y, C
hin
ese A
cad
em
y o
f Scie
nc
es
Example Results
Layer-by-layer
progressive
resolution
increase
81
Ins
titute
of C
om
pu
ting
Tech
no
log
y, C
hin
ese A
cad
em
y o
f Scie
nc
es
Outline
Background A brief history of face recognition in term of benchmark
evolution
Feature learning with CNN (trained with “big” data) For ACM ICMI EmotioW 2014 challenge
For IEEE FG2015 PaSC video-based FR challenge
Auto-encoder for Face X (with “small” data) DAE for face alignment (ECCV14)
DAE for face normalization (CVPR14)
DAE for image super-resolution (ECCV14)
DAE for face detection (under review)
DAE for cross-domain face recognition (under review)
Summary and discussion
82
Ins
titute
of C
om
pu
ting
Tech
no
log
y, C
hin
ese A
cad
em
y o
f Scie
nc
es
Enhance Face Detection Via AE
Using AEs after
AdaBoosting
Classify “hard”
face/non-face
candidate windows
Handle multi-view
candidates together
83
Ins
titute
of C
om
pu
ting
Tech
no
log
y, C
hin
ese A
cad
em
y o
f Scie
nc
es
Enhance Face Detection Via AE
Comparison on FDDB
84
Ins
titute
of C
om
pu
ting
Tech
no
log
y, C
hin
ese A
cad
em
y o
f Scie
nc
es
Outline
Background A brief history of face recognition in term of benchmark
evolution
Feature learning with CNN (trained with “big” data) For ACM ICMI EmotioW 2014 challenge
For IEEE FG2015 PaSC video-based FR challenge
Auto-encoder for Face X (with “small” data) DAE for face alignment (ECCV14)
DAE for face normalization (CVPR14)
DAE for image super-resolution (ECCV14)
DAE for face detection (under review)
DAE for cross-domain face recognition (under review)
Summary and discussion
85
Ins
titute
of C
om
pu
ting
Tech
no
log
y, C
hin
ese A
cad
em
y o
f Scie
nc
es
Problem to Solve
Unsupervised domain adaptation
Example Source domain: visible lighting face images
Target domain: near infra-red face images
Basic idea
Represent source domain samples via non-
linear combination of target domain samples
86
Ins
titute
of C
om
pu
ting
Tech
no
log
y, C
hin
ese A
cad
em
y o
f Scie
nc
es
Our Method
Bi-shifting Auto-Encoder Reduce domain discrepancy
Network structure Common encoder fc Separate decoders for either
domain gs for source domain
gt for target domain
Shifted source domain Targetized (avatar in target
domain, virtual target sample)
Labels preserved (for subsequent
supervised learning)87
Ins
titute
of C
om
pu
ting
Tech
no
log
y, C
hin
ese A
cad
em
y o
f Scie
nc
es
Our Method
Objective function
Self-reconstruction of both domains (AE)
Target samples are sparsely represented
by source samples, and vice versa.
Optimization
Alternating optimization of fc&gs> and
Bs&Bt
88
Ins
titute
of C
om
pu
ting
Tech
no
log
y, C
hin
ese A
cad
em
y o
f Scie
nc
es
Results
Evaluation of domain adaptation across
ethnicity
89
Ins
titute
of C
om
pu
ting
Tech
no
log
y, C
hin
ese A
cad
em
y o
f Scie
nc
es
Results
Evaluation of domain adaptation across
imaging sensor
90
Ins
titute
of C
om
pu
ting
Tech
no
log
y, C
hin
ese A
cad
em
y o
f Scie
nc
es
Outline
Background A brief history of face recognition in term of benchmark
evolution
Feature learning with CNN (trained with “big” data) For ACM ICMI EmotioW 2014 challenge
For IEEE FG2015 PaSC video-based FR challenge
Auto-encoder for Face X (with “small” data) DAE for face alignment (ECCV14)
DAE for face normalization (CVPR14)
DAE for image super-resolution (ECCV14)
DAE for face detection (under review)
DAE for cross-domain face recognition (under review)
Summary and discussion
91
Ins
titute
of C
om
pu
ting
Tech
no
log
y, C
hin
ese A
cad
em
y o
f Scie
nc
es
Summary and discussion
DL (esp. CNN) wins with “big” data
So, collect big data…
The deeper, the better (?)
No ability to collect big data? Or, big
data is impossible?
DAE works for nonlinear transform
Past experiences help to build model
Data structure helps to design network
Priors helps to design the objective
functions
92
Ins
titute
of C
om
pu
ting
Tech
no
log
y, C
hin
ese A
cad
em
y o
f Scie
nc
es
Collaborators
93
Xilin Chen Ruiping Wang Mein Kan
Jie Zhang Mengyi Liu Zhiwu Huang Shaoxin Li
Hong Chang
Zhen Cui