deep models for face processing - shanghaitechssds2015.shanghaitech.edu.cn/slides/tutorial... ·...

Deep Models for Face Processing

with “Big” or “Small” Data

Shiguang Shan

Institute of Computing Technology, Chinese

Academy of Sciences

Ins

titute

of C

om

pu

ting

Tech

no

log

y, C

hin

ese A

cad

em

y o

f Scie

nc

es

Outline

Background A brief history of face recognition in term of benchmark

evolution

Feature learning with CNN (trained with “big” data) For ACM ICMI EmotioW 2014 challenge

For IEEE FG2015 PaSC video-based FR challenge

Auto-encoder for Face X (with “small” data) DAE for face alignment (ECCV14)

DAE for face normalization (CVPR14)

DAE for face super-resolution (ECCV14)

DAE for face detection (under review)

DAE for face cross-domain recognition (under review)

Summary and discussion

2

Ins

titute

of C

om

pu

ting

Tech

no

log

y, C

hin

ese A

cad

em

y o

f Scie

nc

es

A Brief History of FR

Academic milestones ORL, E Yale B, AR: 1990~ (<130 subjects)

Recognition rate: 95%~99% [J.Wright et al, 2008]

Typical methods: linear models (PCA, LDA, SRC)

3

ORL(40 subjects，10 ipp) AR(126 subjects，26ipp）

Ins

titute

of C

om

pu

ting

Tech

no

log

y, C

hin

ese A

cad

em

y o

f Scie

nc

es


Academic Milestones ORL, E Yale B, AR: 1990~ (<130 subjects)

FERET: 1994~2010 (1196 subjects, 2~5ipp) Recognition rates: 99%~94% (Dup.I&II) [S.Xie, S.Shan,

X.Chen, IEEE T IP10]

4

Ins

titute

of C

om

pu

ting

Tech

no

log

y, C

hin

ese A

cad

em

y o

f Scie

nc

es



FERET: 1994~2010 (1196 subjects, 2~5ipp) Recognition rates: 99%~94% (Dup.I&II) [S.Xie, S.Shan,

X.Chen, IEEE T IP10]

Methods: local Gabor magnitude + local Gabor phase +

Block-based LDA

5

Gabor + BFLD

Gabor + BFLD

S1 S2 SM Sum

S

Ins

titute

of C

om

pu

ting

Tech

no

log

y, C

hin

ese A

cad

em

y o

f Scie

nc

es

6

A Brief History of FR—FERET

1. Claudio A.Perez , LeonardoA.Cament, LuisE.Castillo. Methodological improvement on local Gabor face recognition based on

feature selection and enhanced Borda count. Pattern Recognition 44 (PR2011), 951–963

2. Georgios Tzimiropoulos, Stefanos Zafeiriou, Maja Pantic. Subspace Learning from Image Gradient Orientations. IEEE

Transactions on Pattern Analysis And Machine Intelligence, IEEE T PAMI2012

3. Hieu V. Nguyen, Li Bai, and Linlin Shen, Local Gabor Binary Pattern Whitened PCA: A Novel Approach for Face Recognition

from Single Image Per Person. ICB 2009, LNCS 5558, pp. 269–278, 2009

4. Andrew Wagner, John Wright, Arvind Ganesh, Zihan Zhou, Hossein Mobahi, and Yi Ma. Toward a Practical Face Recognition

System: Robust Alignment and is And Machine Intelligence, IEEE T PAMI2012

5. Ngoc-Son Vu, Alice Caplier. Face Recognition with Patterns of Oriented Edge Magnitudes. ECCV2010

6. A. Timo, H. Abdenour, and P. Matti. Face recognition with Local Binary Patterns. ECCV 2004

Comparative methods Probe sets of FERET

(Released by NIST)

FB FC Dup.I Dup.II

Our method [T IP10] 99% 100% 94% 93%

[1]LGP + Borda Count (PR11) 99.8% 99.5% 89.2% 86.8%

[2]Image Gradient Orientations(T PAMI12) - - 88.9% 85.4%

[3]LGBP+Whitened PCA (ICB09) 98.1% 98.9% 83.8% 81.6%

[4]Oriented Edge Magnitudes (ECCV10) 98.1% 99% 79% 79.1%

[5]Improved SRC (T PAMI12) 96.6% 58.8% 71.6% 61.5%

[6]LBP (ECCV04) 97% 79% 66% 64%

Ins

titute

of C

om

pu

ting

Tech

no

log

y, C

hin

ese A

cad

em

y o

f Scie

nc

es

A Brief History of FR—FRGC


FERET: 1994~2010 (1196 subjects, 2~5ipp)

FRGC v2.0: 2004~2012 (~500subjects, ~50ipp) VR=96%@ FAR=0.1% [Y.Li, S.Shan, H.Zhang, S.Lao, X.Chen,

ACCV12]

7

Ins

titute

of C

om

pu

ting

Tech

no

log

y, C

hin

ese A

cad

em

y o

f Scie

nc

es




FRGC v2.0: 2004~2012 (~500subjects, ~50ipp) VR=96%@ FAR=0.1% [Y.Li, S.Shan, H.Zhang, S.Lao, X.Chen,

ACCV12]

Method: local Gabor magnitude + LPQ + Block-LDA

8

Ins

titute

of C

om

pu

ting

Tech

no

log

y, C

hin

ese A

cad

em

y o

f Scie

nc

es

9


FRGC test set

Methods

Verification Rate (when FAR=0.1%)

Exp.1 Exp.4

FRGC Baseline (Eigenfaces) 66% 12%

Hybrid Fourier [Hwang 2006] 91% 74%

KFA [Liu 2006] 92% 76%

DCT_EFM [Liu 2008] n/a 84%

Gabor+LDA [Han 2010] 97% 78%

LBP & Gabor + KLDA+SN [Tan 2010] N/A 88%

Our methods [Su 2009] 98% 89%

RTF + RCF [Deng 2010] 99% 93.5%

Our Methods [Li 2012] 99% 96%

[Hwang 06] W. Hwang, et. al, Multiple Face Model of Hybrid Fourier Feature for Large Face Image Set, In CVPR’06.

[Liu 06] C. Liu, Capitalize on dimensionality increasing techniques for improving face recognition performance, In PAMI 2006.

[Liu 08] Z. Liu and C. Liu, Fusion of the complementary Discrete Cosine Features in the YIQ color space for face recognition, in CVIU 2008.

[Han 10] Z. Han, C. Fang, X. Ding, A Discriminated Correlation Classifier for Face Recognition, Proc. of 2010 ACM Sym. on Applied Computing, 2010

[Tan 10] X.Tan, B.Triggs. Enhanced Local Texture Feature Sets for Face Recognition Under Difficult Lighting Conditions. IEEE T IP 19(6), 2010.6

[Deng 13] Deng, W., Hu, J., Guo, J., Cai, W., Feng, D.: Emulating biological strategies for uncontrolled face recognition., PR, 2013

[Li 12]Y.Li, S. Shan, H. Zhang, S. Lao, X. Chen. Fusing Magnitude and Phase Features for Robust Face Recognition, ACCV2012

Ins

titute

of C

om

pu

ting

Tech

no

log

y, C

hin

ese A

cad

em

y o

f Scie

nc

es

A Brief History of FR—MBE2010



FRGC v2.0: 2004~2012 (~500subjects, ~50ipp)

NIST MBE 2010 (1.6M subjects, ~2ipp) Scenario: ID photo vs. ID photo

Face identification: 1: N (close set) with N=1.6 million

10

Ins

titute

of C

om

pu

ting

Tech

no

log

y, C

hin

ese A

cad

em

y o

f Scie

nc

es

A Brief History of FR—China Test




NIST MBE 2010 (1.6M subjects, ~2ipp)

Test on China 2nd generation ID card photos 2010

1: N(10 M faces): ~90% (Sagem solutions)

1: N (2.7M faces): ~92% (Ours, 100K probes)

11

Ins

titute

of C

om

pu

ting

Tech

no

log

y, C

hin

ese A

cad

em

y o

f Scie

nc

es

A Brief History of FR—LFW






LFW: since 2007 (~5749 celebrities, 1680 >2ipp) 95.17% [D.Chen, X. Cao, F. Wen, J. Sun, CVPR13]

97.35% [Y.Taigman, M.Yang, M.Ranzato, L.Wolf. CVPR14]

97.45% [Y. Sun, X. Wang, and X. Tang, CVPR14]

>99.5% [Deep ID3, Face++, Tencent, insky.so, …]

99.63% [FaceNet, F. Schroff, D. Kalenichenko, J.Philbin, CVPR15]

12

Ins

titute

of C

om

pu

ting

Tech

no

log

y, C

hin

ese A

cad

em

y o

f Scie

nc

es

A Brief History of FR—LFW






LFW: since 2007 (~5749 celebrities, 1680 >2ipp) 95.17% [D.Chen, X. Cao, F. Wen, J. Sun, CVPR13]

Method: High dimensional LBP + Joint Bayesian

97.35% [Y.Taigman, M.Yang, M.Ranzato, L.Wolf. CVPR14]

97.45% [Y. Sun, X. Wang, and X. Tang, CVPR14]

>99.5% [Deep ID3, Face++, Tencent, insky.so, …]

99.63% [FaceNet, F. Schroff, D. Kalenichenko, J.Philbin, CVPR15]

13

Method:

Deep

Learning

Ins

titute

of C

om

pu

ting

Tech

no

log

y, C

hin

ese A

cad

em

y o

f Scie

nc

es

More About LFW Evaluation

Labeled Face in the Wild (LFW)

Face Verification (1:1) on celebrity faces Photos from Yahoo news

Evaluation protocol Training set: unrestricted

Testing set 6000 image pairs

Half of the same person

Half from different persons

14

Huang G B, Ramesh M, Berg T, et al. Labeled faces in the wild: A database for

studying face recognition in unconstrained environments. Technical Report,

University of Massachusetts, Amherst, 2007.

Ins

titute

of C

om

pu

ting

Tech

no

log

y, C

hin

ese A

cad

em

y o

f Scie

nc

es

More About LFW Evaluation

2014: DeepFace [1] (Facebook)

Training: 4K subjects，4.4M images

2015: DeepID2+ [2]

Training:

10K celebrities, 202K images

15

DeepFaceDeepID2+

[1] Taigman Y, Yang M, Ranzato M A, et al. Deepface: Closing the gap to human-

level performance in face verification. CVPR, 2014.

[2] Sun Y, Wang X, Tang X. Deeply learned face representations are sparse,

selective, and robust. arXiv preprint, 2014.

Ins

titute

of C

om

pu

ting

Tech

no

log

y, C

hin

ese A

cad

em

y o

f Scie

nc

es

Is Face Recognition Solved?

No! Many distinct scenarios.

Some almost solved, some far from solved.

Scenarios almost solved Face verification in controlled environment with cooperative

users

Time attendance, access control (low security requirement),

student verification in exams

Duplicate identity checking based on face photos

MPS duplicate passport checking (0.2 billion faces)

VIP watch list screening

Bank, shops, stores…

Celebrity face retrieval

16

Ins

titute

of C

om

pu

ting

Tech

no

log

y, C

hin

ese A

cad

em

y o

f Scie

nc

es




Scenarios almost solved Face verification in controlled environment with cooperative

users

Time attendance, access control (low security requirement),

student verification in exams

Unsolved: twins, large plastic surgery

Duplicate identity checking based on face photos

MPS duplicate passport checking (0.2 billion faces)

Unsolved: naturally similar faces

Whitelist (e.g.VIP) screening

Bank, shops, stores…(Half a loaf is better than no bread)

Celebrity face retrieval Recall rate is not seriously considered…

17

Ins

titute

of C

om

pu

ting

Tech

no

log

y, C

hin

ese A

cad

em

y o

f Scie

nc

es




Scenarios not solved Face verification with very high security demand

Scenarios: payment with face, access control..

Face verification against ID photos

E.g., based on China 2nd generation ID card photos

Blacklist screening for video surveillance

18

Ins

titute

of C

om

pu

ting

Tech

no

log

y, C

hin

ese A

cad

em

y o

f Scie

nc

es




Scenarios not solved Face verification with very high security demand

Scenarios: payment with face, access control..

Not convenient enough As false reject rate >30% @FAR=0.01%

Anti-spoof is hard: photo, video, synthesized video…

Face verification against ID photos

E.g., China 2nd generation ID card photos

False reject rate >30% @FAR=0.1% with large photo

False reject rate >50% @FAR=0.1% with on-chip photo

Blacklist screening for video surveillance Recognition rate <30% @FAR=0.01%

surveillance videos: low-quality

Lack of large-scale training/testing data for this kind of scenario 19

Ins

titute

of C

om

pu

ting

Tech

no

log

y, C

hin

ese A

cad

em

y o

f Scie

nc

es

Advertisement: a new database

COX video face database

http://vipl.ict.ac.cn/resources/datasets/cox-

face-dataset

Features of COX

1000 subjects, each 1 high quality still image

3 low quality video clips

from 3 camcorders

(Aim to) simulate

video surveillance

Evaluation protocols

20

http://vipl.ict.ac.cn/resources/datasets/cox-face-dataset

Ins

titute

of C

om

pu

ting

Tech

no

log

y, C

hin

ese A

cad

em

y o

f Scie

nc

es

Advertisement: a new database

COX: Still image vs. video clips

Verification rate <40% @ FAR=0.1%

21

Ins

titute

of C

om

pu

ting

Tech

no

log

y, C

hin

ese A

cad

em

y o

f Scie

nc

es

Outline


evolution









22

M. Liu, R. Wang, S. Li, Z.Huang, S.Shan, X. Chen. Combining Multiple Kernel Methods

on Riemannian Manifold for Emotion Recognition in the Wild. ACM ICMI 2014

Ins

titute

of C

om

pu

ting

Tech

no

log

y, C

hin

ese A

cad

em

y o

f Scie

nc

es

Feature Learning with CNN

CNN can learn features with surprising results

if only you have big data!

Feature engineering vs. feature learning Manually-designed filters vs. learned filters

Former: low-level features (e.g. gradient…)

Latter: mid-level, high-level, increasing explanatory

and abstract, closer to semantics

Learned but sharable for different tasks Feature learned with task-specific objective

Nevertheless, sharable and easy to transfer

Two examples of our practices

23

Ins

titute

of C

om

pu

ting

Tech

no

log

y, C

hin

ese A

cad

em

y o

f Scie

nc

es

EmotioW 2014: Task

Task

Classify a sample audio-video clip into one of

the seven categories Neutral, anger, disgust, fear, happy, sad, surprise

Challenge

Close-to-real-world conditions Large variations e.g. head pose, illumination,

partial occlusion, etc.

24

Ins

titute

of C

om

pu

ting

Tech

no

log

y, C

hin

ese A

cad

em

y o

f Scie

nc

es

EmotioW 2014: Data

Challenging data

AFEW* 4.0 database audio-video clips collected from movies showing

close-to-real-world conditions

25

Attribute of AFEW 4.0 Description

Length of sequences 300-5400ms

Number of annotators 3

Emotion categories Anger, disgust, fear, happiness, neutral, sadness, and surprise

Audio/Video format Audio: WAV; Video: AVI

# of samples 1368

# of subjects 428

# of movies 111

*Acted Facial Expression in Wild

Ins

titute

of C

om

pu

ting

Tech

no

log

y, C

hin

ese A

cad

em

y o

f Scie

nc

es

EmotioW 2014: Protocols

Evaluation protocols Dataset division: training, validation, and testing

The test labels were unknown.

Either audio/video modality or both can be used.

26

Set # of subjects Min. Age Max. Age Avg. Age # of Males # of Females

Train 177 5 76 34 102 75

Val 136 10 70 35 78 58

Test 115 5 88 34 64 51

Anger Digust Fear Happiness Neutral Sadness Surprise

Train 92 66 66 105 102 82 54

Val 59 39 44 63 61 59 46

Test 58 26 46 81 117 53 26

Ins

titute

of C

om

pu

ting

Tech

no

log

y, C

hin

ese A

cad

em

y o

f Scie

nc

es

Our method

27

Linear Subspace Covariance Matrix Gaussian Distribution

Video (Image Set) ModelingImage Feature on Aligned Faces

Dense SIFT

…

HOG

…

DCNN

Stage 1: Emotion Video Representation

Stage 2: Emotion

Video

Recognition

Classification on Riemannian Manifold via Kernel SVM/LR/PLS

Score-level

Fusion

M. Liu, R. Wang, S. Li, Z.Huang, S.Shan, X. Chen. Combining Multiple Kernel Methods

on Riemannian Manifold for Emotion Recognition in the Wild. ACM ICMI 2014

Ins

titute

of C

om

pu

ting

Tech

no

log

y, C

hin

ese A

cad

em

y o

f Scie

nc

es

Our method

Image features Aligned face images: 64x64; Features: HOG, dense SIFT, DCNN.

DCNN

CaffeNet trained on CFW database Trained over 150,000 face images from 1520 subjects

Identities are served as supervised label in the deep networks

Architecture 3@237x237 > 96@57x57 > 96@28x28 > 256@28x28

> 384@14x14 > 256@14x14 > 256@7x7 > 4096 > 1520 Output of the last convolutional layer as final image features: 256x7x7=12, 544

HOG

Block size: 16x16; stride: 8; # of blocks: 7x7=49

# of cells per block: 2x2; # of bins: 9; # of total dims: 2x2x9x49=1764

Dense SIFT

Block size: 16x16; stride: 8; # of points: 7x7=49

# of dims per point: 4x4x8=128; # of total dims: 128x49=6272

28

Ins

titute

of C

om

pu

ting

Tech

no

log

y, C

hin

ese A

cad

em

y o

f Scie

nc

es

Our Results

Combine multiple features

29

MethodsAccuracy (%)

Validation set Test set

Baseline (provided by EmotiW organizers) 34.40 33.70

Audio (OpenSMILE Toolkit) 30.73 --

Video

HOG 38.01 --

Dense SIFT 43.94 --

DCNN (Caffe-CFW) 43.40 --

HOG + Dense SIFT 44.47 --

HOG + Dense SIFT + DCNN (Caffe-CFW) 45.28 --

Audio + Video ( HOG+Dense SIFT ) 46.36 46.68

Audio+Video ( HOG + Dense SIFT + DCNN (Caffe-CFW) ) 48.52 50.37

Ins

titute

of C

om

pu

ting

Tech

no

log

y, C

hin

ese A

cad

em

y o

f Scie

nc

es

Final Results of Competition

30

Ins

titute

of C

om

pu

ting

Tech

no

log

y, C

hin

ese A

cad

em

y o

f Scie

nc

es

Outline


evolution









31

Ins

titute

of C

om

pu

ting

Tech

no

log

y, C

hin

ese A

cad

em

y o

f Scie

nc

es

FG 2015 Video FR Challenge

Task: video-to-video face verification

Exp. 1: Controlled case Video-to-video verification

1920*1080 video captured by mounted camera

Exp. 2: Handheld case Video-to-video verification

Varying resolution from 640*480~1280*720

Videos from a mix of different handheld point-and-

shoot video cameras

32

Ins

titute

of C

om

pu

ting

Tech

no

log

y, C

hin

ese A

cad

em

y o

f Scie

nc

es

FG 2015 Video FR Challenge

Videos for testing in the PaSC datasets

[Beveridge, BTAS’13]

33

Ins

titute

of C

om

pu

ting

Tech

no

log

y, C

hin

ese A

cad

em

y o

f Scie

nc

es

Results in IJCB 2014

Verification rates at FAR=1% for the video-to-video

(Exp. 1) and video-to-still (Exp.2) tasks.

[Beveridge, IJCB’14]Handheld experiment

Best method: Haoxiang Li, Gang Hua. Eigen Probabilistic Elastic Part

(Eigen-PEP) model, CVPR13/ICCV1334

Ins

titute

of C

om

pu

ting

Tech

no

log

y, C

hin

ese A

cad

em

y o

f Scie

nc

es

Our Method

DCNN (single frame feature)

HERML(video representation and classification)

Hybrid Euclidean-and-Riemannian Metric Learning

(HERML) [Huang, Wang, Shan, Chen, ACCV’14]

ℝ𝑑

𝑆𝑦𝑚𝑑+1+

𝑆𝑦𝑚𝑑+

Gaussian

Covariance

Mean

Video

(a) Mul. statistics (b) Hetero. spaces

KLDAFusing on

Score level

(c) KDA Leaning

KLDA

KLDA

Frame

DCNN [Jia’13]

Layer 1-2: Conv

Input Image

Layer 1-1: Conv

Layer 2-1: Conv

Layer 2-2: Conv

Layer 3-1: Conv

Layer 3-2: Conv

Layer 4-1: Conv

Layer 4-2: Conv

Layer 6-1: Full

Layer 6-2: Full

Softmax Output

Layer 1-3: Conv + Pool




Layer 5-1: Conv

Layer 5-2: Conv

35

Ins

titute

of C

om

pu

ting

Tech

no

log

y, C

hin

ese A

cad

em

y o

f Scie

nc

es

Training Models

Training DCNN Caffe, Jia’1314 Cov. Layers (from 5)

Pre-train: CFW

Start learning rate: 0.01

153,461 images from 1520 persons

Fine-tune: PaSC training set + COX

Start learning rate: 0.001

PaSC training set 170 persons, 38113 images

COX training set (our own, surveillance-like videos) 1000 persons, 147,737 video frames

Features exploited finally 2,048 dimensional features of fc 6-2 layer for

each frame

36

Ins

titute

of C

om

pu

ting

Tech

no

log

y, C

hin

ese A

cad

em

y o

f Scie

nc

es

Training Models

Training HERML

1,165 videos from 470 person, from two

heterogeneous datasets PaSC training set

170 persons, 265 videos

COX training set 300 persons, 900 videos (3 videos/person)

Final feature dimensions (per video)

1320 (440*3)-dimensional (KLDA features)

37

Ins

titute

of C

om

pu

ting

Tech

no

log

y, C

hin

ese A

cad

em

y o

f Scie

nc

es

Evaluation Results

The deeper the better

control：41.40%，handheld：41.62%

Input Image

Layer 3-1: Conv

Layer 3-2: Conv

Layer 4-1: Full

Layer 4-2: Full

Softmax Output

Layer 1: Conv + Pool

Layer 2: Conv + Pool


control: 47.41%

handheld: 48.02%

Input Image

Layer 1-1: Conv

Layer 2-1: Conv

Layer 3-1: Conv

Layer 4-1: Conv

Layer 5-1: Full

Layer 5-2: Full

Softmax Output





control: 54.76%

handheld: 56.20%

Layer 1-2: Conv

Input Image

Layer 1-1: Conv

Layer 2-1: Conv

Layer 2-2: Conv

Layer 3-1: Conv

Layer 3-2: Conv

Layer 4-1: Conv

Layer 4-2: Conv

Layer 6-1: Full

Layer 6-2: Full

Softmax Output





Layer 5-1: Conv

Layer 5-2: Conv


DCNN + HERML (set models)

control： 56.20%，handheld：54.41%


DCNN for single frame

38

Ins

titute

of C

om

pu

ting

Tech

no

log

y, C

hin

ese A

cad

em

y o

f Scie

nc

es

Primary Results

Image features

HOG < Dense SIFT << DCNN

MethodHOG Dense SIFT DCNN

Control Handheld Control Handheld Control Handheld

HERML 25.26 19.28 33.82 28.93 58.63 59.14

39

Ins

titute

of C

om

pu

ting

Tech

no

log

y, C

hin

ese A

cad

em

y o

f Scie

nc

es

Outline


evolution







DAE for cross-domain face recognition (under review)


40

Ins

titute

of C

om

pu

ting

Tech

no

log

y, C

hin

ese A

cad

em

y o

f Scie

nc

es

Auto-encoder for Face X

CNN is good, but

Need big data to train

Slow, not only training, but also testing

Mainly good for feature learning

Auto-encoder

Simple

Fast in both training and testing

A general non-linear transform

Not so good for feature learning

Some example practices

41

Ins

titute

of C

om

pu

ting

Tech

no

log

y, C

hin

ese A

cad

em

y o

f Scie

nc

es

Problem to Solve

Face Alignment

Predict facial landmarks from detected face

Detected face

region I(u,v)

Facial landmarks

S=(x1,y1, x2, y2, …, xL, yL)

42

𝑺 = 𝑯 𝑰 , 𝑰 ∈ 𝑹𝒘∗𝒉, 𝑺 ∈ 𝑹𝟐𝑳,

Goal

J. Zhang, S. Shan, M. Kan, X. Chen. Coarse-to-Fine Auto-Encoder Networks

(CFAN) for Real-Time Face Alignment. ECCV2014 (oral)

Ins

titute

of C

om

pu

ting

Tech

no

log

y, C

hin

ese A

cad

em

y o

f Scie

nc

es

Challenges

H: a complex nonlinear mapping

Large appearance & shape variations Head pose

Expressions

Illumination

Partial occlusion

43

Ins

titute

of C

om

pu

ting

Tech

no

log

y, C

hin

ese A

cad

em

y o

f Scie

nc

es

Related Works

ASM & AAM [Cootes’95; Gu’08; Cootes’01; Matthews’04 ]

Sensitive to initial shapes

Sensitive to noise

Hard to cover complex variations

DCNN [Sun’13; Toshev’14]

Shape regression model

Linear Regression [X. Chai, S. Shan, W. Gao. ICASSP’03]

CPR,ESR,RCPR [Dollar’10; Cao’12; Burgos-Artizzu’13]

DRMF [Asthana’13]

SDM [Xiong’13]

44

𝑺 = 𝑾𝑰

Ins

titute

of C

om

pu

ting

Tech

no

log

y, C

hin

ese A

cad

em

y o

f Scie

nc

es

Motivation

Directly apply Stacked Auto-Encoder

(SAE)? OK, but not good. Why?

Easily overfit to small data Typically only thousands of images with

landmark annotations

Our ideas – exploiting priors

Handcrafted features Avoid convolution (slow, bid data…)

SIFT, shape-indexed

Better initialization

Coarse to fine: piecewise non-linear

45

𝐼

S

Ins

titute

of C

om

pu

ting

Tech

no

log

y, C

hin

ese A

cad

em

y o

f Scie

nc

es

Schema of Coarse-to-Fine AE Networks

Our Method

𝐼

𝑺𝟏

∅(𝑆0)

𝑺𝟐

∅(𝑆1)

𝑺𝟑

∅(𝑆2)

Nonlinear𝑯𝟎 Nonlinear 𝑯𝟏 Nonlinear 𝑯𝟐 Nonlinear 𝑯𝟑

𝑺𝟎

Global SAN Local SANs

SAN: Stacked Auto-encoder Network46

Ins

titute

of C

om

pu

ting

Tech

no

log

y, C

hin

ese A

cad

em

y o

f Scie

nc

es

Our Method

𝐼

𝑆0

𝑆0

∆𝑆2

𝑆1 + ∆𝑆2

𝑆2

… …

∅(𝑆1)

∆𝑆3

𝑆2 + ∆𝑆3

𝑆3

… …

∅(𝑆2)

∆𝑆1

𝑆0 + ∆𝑆1

𝑆1

… …

∅(𝑆0)

47

Global SAN Local SAN#1 Local SAN#2 Local SAN#3

Ins

titute

of C

om

pu

ting

Tech

no

log

y, C

hin

ese A

cad

em

y o

f Scie

nc

es

Global SAN

Mapping 𝐻0 from image 𝐼 to shape 𝑆.

𝐻0∶ 𝑆 ← 𝐼

Model 𝐻0 as a Stacked Auto-encoder:

𝐻0∗ = argmin

𝐻0𝑆 − 𝑓𝑘(𝑓𝑘−1(…𝑓1(𝐼))) 2

2+𝛼 𝑖=1𝑘 𝑊𝑖 𝐹

2

Our Method

Regularization

𝑓𝑘 𝑎𝑘−1 = 𝑊𝑘𝑎𝑘−1 + 𝑏𝑘 ≜ 𝑆0

𝑓𝑖 𝑎𝑖−1 = σ 𝑊𝑖𝑎𝑖−1 + 𝑏𝑖 ≜ 𝑎𝑖 , 𝑖 = 1, … , 𝑘 − 1

Regression

𝐼

𝑆0

48

Ins

titute

of C

om

pu

ting

Tech

no

log

y, C

hin

ese A

cad

em

y o

f Scie

nc

es

Local SAN

Initialize shape 𝑆0 from global SAN.

Predict shape deviation with AE Refine the shape with local features

∅(𝑆0): 𝑆0 shape indexed local features PCA of concatenated SIFT features

𝐻1∗ = argmin

𝐻1∆𝑆1 − ℎ𝑘

1 …ℎ11 ∅ 𝑆0

2

2+ 𝛼

𝑖=1

𝑘

𝑊𝑖1𝐹

2

∆𝑆1 = 𝑆 − 𝑆0

Our Method

∆𝑆𝑗

𝑆0 + ∆𝑆𝟏

𝑆1

… …

∅(𝑆0)

49

Ins

titute

of C

om

pu

ting

Tech

no

log

y, C

hin

ese A

cad

em

y o

f Scie

nc

es

Our Method

𝑺𝟎 𝑺𝟏 𝑺𝟐 𝑺𝟑

Coarse-to-fine Cascade

𝐻𝑗∗ = argmin

𝐻𝑗∆𝑆𝑗 − ℎ𝑘

𝑗…ℎ1𝑗∅ 𝑺𝒋−𝟏

2

2

+ 𝛼

𝑖=1

𝑘

𝑊𝑖𝑗

𝐹

2

Larger search region/step Smaller search region/step

𝑗: index of local SAN

𝑘: index of hidden layer

50

Ins

titute

of C

om

pu

ting

Tech

no

log

y, C

hin

ese A

cad

em

y o

f Scie

nc

es

Details

Global SAN

Input: 50x50 image, vectorized to 2500

3 hidden layers:1600，900，400

Local SANs

Face resolution:

80x80140x140140x140

Shape indexed SIFT feature 128*68=8704, dimension reduced by PCA to

1695, 2418, 2440 respectively for 3 SANs

Output: 136D shape deviation

3 hidden layers:1296，784，40051

Ins

titute

of C

om

pu

ting

Tech

no

log

y, C

hin

ese A

cad

em

y o

f Scie

nc

es

Experiments

Datasets for evaluation XM2VTS [Messer’99]

Test: 2360 face images

Training: 3478 images (LFPW training set，Helen，AFW)

LFPW [Belhumeur’11]

Test: 300 test images collected from wild condition

Training: 3478 images (LFPW training set，Helen，AFW)

HELEN [Le’12]

Test: 330 images in the wild

Training: 3148 images (LFPW and Helen training, AFW)

AFW [Zhu’12]

205 images with 468 faces collected from the wild

52

Ins

titute

of C

om

pu

ting

Tech

no

log

y, C

hin

ese A

cad

em

y o

f Scie

nc

es

Experiments

Evaluation of different SANs

Performance gain of each SAN

(Conduct on LFPW)

ms

53

0.25

7.63 7.28 7.68

0123456789

GlobalSAN

LocalSAN 1

LocalSAN 2

LocalSAN 3

Run Time (ms)

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.05 0.1 0.15 0.2 0.25 0.3

Dat

a P

rop

ort

ion

NRMSE

Mean Shape

Global SAN

Local SAN 1

Local SAN 2

Local SAN 3

Ins

titute

of C

om

pu

ting

Tech

no

log

y, C

hin

ese A

cad

em

y o

f Scie

nc

es

Experiments(3/8)

Comparative Methods Local Models with Regression Fitting

SDM [Xiong’13]

DRMF [Asthana’13]

Tree-structured Models Zhu et al. [Zhu’12]

Yu et al. [Yu’13]

Deep Model DCNN [Sun’13]

54

Ins

titute

of C

om

pu

ting

Tech

no

log

y, C

hin

ese A

cad

em

y o

f Scie

nc

es

Experimental Result(4/8)

Performance comparisons on HELEN

55

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.05 0.1 0.15 0.2

Dat

a P

rop

ort

ion

NRMSE

Zhu et al.

Yu et al.

DRMF

SDM

Our method

Ins

titute

of C

om

pu

ting

Tech

no

log

y, C

hin

ese A

cad

em

y o

f Scie

nc

es


Performance comparisons on LFPW

56

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.05 0.1 0.15 0.2

Dat

a P

rop

ort

ion

NRMSE

Zhu et al.

Yu et al.

DRMF

SDM

Our method

Ins

titute

of C

om

pu

ting

Tech

no

log

y, C

hin

ese A

cad

em

y o

f Scie

nc

es


Performance comparisons on XM2VTS

57

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0.03 0.04 0.05 0.06 0.07 0.08

Dat

a P

rop

ort

ion

NRMSE

Zhu et al.

Yu et al.

DRMF

SDM

Our method

Ins

titute

of C

om

pu

ting

Tech

no

log

y, C

hin

ese A

cad

em

y o

f Scie

nc

es


Comparisons with DCNN* [Sun et al., CVPR’13]

Note: The performance is evaluated in terms of five common landmarks

XM2VTS LFPW HELEN

58

Ins

titute

of C

om

pu

ting

Tech

no

log

y, C

hin

ese A

cad

em

y o

f Scie

nc

es


Pose Expression Beard Sunglass Occlusion

59

Ins

titute

of C

om

pu

ting

Tech

no

log

y, C

hin

ese A

cad

em

y o

f Scie

nc

es

Global SAN achieves more accurate

initialization

SAE well characterizes the non-linearity

from appearance to face shape

Coarse-to-fine strategy is effective

Alleviate the local minimum problem

Impressive improvement and real-time

performance

CFAN Summary

60

J. Zhang, S. Shan, M. Kan, X. Chen. Coarse-to-Fine Auto-Encoder Networks (CFAN)

for Real-Time Face Alignment. ECCV2014 (oral)

Ins

titute

of C

om

pu

ting

Tech

no

log

y, C

hin

ese A

cad

em

y o

f Scie

nc

es

Outline


evolution









61

M. Kan, S. Shan, H. Chang, X. Chen. Stacked Progressive Auto-Encoder

(SPAE) for Face Recognition Across Poses. CVPR2014

Ins

titute

of C

om

pu

ting

Tech

no

log

y, C

hin

ese A

cad

em

y o

f Scie

nc

es

Problem and Existing Solutions

Face Recognition Across Pose

Challenges Appearance difference caused by pose, even

larger than that due to identity

Existing Solutions Pose-invariant feature representations

Virtual images at target pose Geometry-based: implicit/explicit 3D recovery

Learning-based: in 2D

√×62

Ins

titute

of C

om

pu

ting

Tech

no

log

y, C

hin

ese A

cad

em

y o

f Scie

nc

es

Regression-based Methods

Predict view from one pose to another

A non-linear transform

Globally linear regression

𝐴𝑃 𝐴𝑃

Learning Predicting

Φ0

Φ𝑃

X. Chai, S. Shan, X. Chen and W. Gao. Locally linear regression for pose-

invariant face recognition. IEEE T IP (2007).63

Ins

titute

of C

om

pu

ting

Tech

no

log

y, C

hin

ese A

cad

em

y o

f Scie

nc

es

Regression-based Methods

Predict view from one pose to another

Globally linear regression Locally

linear regression

X. Chai, S. Shan, X. Chen and W. Gao. Locally linear regression for pose-

invariant face recognition. IEEE T IP (2007).64

Ins

titute

of C

om

pu

ting

Tech

no

log

y, C

hin

ese A

cad

em

y o

f Scie

nc

es

Motivation

How about deep model directly?

Stacked de-noising Auto-Encoder Regard non-fontal view as

contaminated version of frontal view

Unfortunately, fail again

Complex non-linear model

Easily overfit to “Small” data

Our idea -- priors

Pose changes smoothly Stage-wise non-linear

Progressively reach the final goal65

…

…

…

input layer

output layer

encoder 𝒇𝟏

encoder 𝒇𝟐

…

encoder 𝒇𝟑

…

decoder 𝒈𝟑

decoder 𝒈𝟐

decoder 𝒈𝟏

Ins

titute

of C

om

pu

ting

Tech

no

log

y, C

hin

ese A

cad

em

y o

f Scie

nc

es

Our Method

Basic idea

Stacking multiple Progressive single-layer

Auto-Encoders

Each layer maps non-frontal faces to

another with smaller pose

…

…

…

input layer

output layer

decoder 𝐠𝟏

decoder 𝒈𝟐

encoder 𝒇𝟏

encoder 𝒇𝟐

…

encoder 𝒇𝟑

…

decoder 𝒈𝟑

[-45o , +45o]

[-30o , +30o]

[-15o , +15o]

[ 0o]

…

…

66

Ins

titute

of C

om

pu

ting

Tech

no

log

y, C

hin

ese A

cad

em

y o

f Scie

nc

es

Our Method

Basic idea

Take layer#1 as example

…

…

…

input layer

output layer

decoder 𝐠𝟏

decoder 𝒈𝟐

encoder 𝒇𝟏

encoder 𝒇𝟐

…

encoder 𝒇𝟑

…

decoder 𝒈𝟑

[-45o , +45o]

[-30o , +30o]

[-15o , +15o]

[ 0o]

…

…

p(xoutput) = 30o, if p(xinput) >= 30o

p(xoutput) = p(xinput), if p(xinput) < 30o

No need pose

estimation for testing

67

Ins

titute

of C

om

pu

ting

Tech

no

log

y, C

hin

ese A

cad

em

y o

f Scie

nc

es

Our Method

Discussion

Medium goals restrict the model, thus

alleviate overfitting Multi-view database provides the medium goals

Otherwise, too many feasible solutions

output virtual

frontal view

input non-frontal

face image

68

Ins

titute

of C

om

pu

ting

Tech

no

log

y, C

hin

ese A

cad

em

y o

f Scie

nc

es

Our Method

Discussion Medium goals restrict the model, thus alleviate

overfitting with small data Multi-view database provides the medium goals

Otherwise, too many feasible solutions

Cons Not general enough: special needs on training set

With face images taken under multiple poses

CMU PIE or Multi-PIE works for this purpose

Pros Big data is not needed

Pose no need to be known

69

Ins

titute

of C

om

pu

ting

Tech

no

log

y, C

hin

ese A

cad

em

y o

f Scie

nc

es

Our Method

Step1: optimize each single-layer progressive AE

Step2: fine-tune the stacked deep network

Step3: output few topmost hidden layers as pose-robust features

Step4: supervised feature extraction via Fisher Linear

Discriminant analysis (FLD)

Optimization method: CG (conjugate gradient)

70

Ins

titute

of C

om

pu

ting

Tech

no

log

y, C

hin

ese A

cad

em

y o

f Scie

nc

es

Experimental Results

71

Ins

titute

of C

om

pu

ting

Tech

no

log

y, C

hin

ese A

cad

em

y o

f Scie

nc

es


72

Ins

titute

of C

om

pu

ting

Tech

no

log

y, C

hin

ese A

cad

em

y o

f Scie

nc

es


Experiments on Multi-PIE

[-45o, -30o, -15o, 0o, +15o, +30o , +45o]

200 subjects (7ipp) for training (4207 images)

137 subjects (7ipp) for testing (no overlap) Gallery: frontal; probes: images of other poses

5000 neurons (all layers)

Experiments on FERET [-60o, -45o, -30o, -15o, 0o, +15o, +30o , +45o, -60o]

100 subjects (9ipp) for training

100 subjects (9ipp) for testing (no overlap) Gallery: frontal; probes: images of other poses

73

Ins

titute

of C

om

pu

ting

Tech

no

log

y, C

hin

ese A

cad

em

y o

f Scie

nc

es


Comparison on Multi-PIE

Comparison on FERET

74

Ins

titute

of C

om

pu

ting

Tech

no

log

y, C

hin

ese A

cad

em

y o

f Scie

nc

es

SPAE Summary

SPAE performs better than other 2D methods, and comparable to 3D ones

SPAE can narrow down pose variations layer by layer, along pose variation manifold

SPAE needs no pose estimation of test image

Prior domain knowledge does help the design of deep network

75

Ins

titute

of C

om

pu

ting

Tech

no

log

y, C

hin

ese A

cad

em

y o

f Scie

nc

es

Outline


evolution





DAE for image super-resolution (ECCV14)




76

Zhen Cui, Hong Chang, Shiguang Shan, Bineng Zhong, and Xilin Chen. Deep

Network Cascade for Image Super-resolution. ECCV 2014

Ins

titute

of C

om

pu

ting

Tech

no

log

y, C

hin

ese A

cad

em

y o

f Scie

nc

es

Problem to Solve

77

Ins

titute

of C

om

pu

ting

Tech

no

log

y, C

hin

ese A

cad

em

y o

f Scie

nc

es

Our Method

Layer-wise non-linear

Stacked (NLSS+CLA)

NLSS

Non-local self-similarity

Patch reconstruction

CLA

Collaborative Local

Auto-encoder

Denoising reconstructed

patches via AE jointly

78

Deep Network Cascade (DNC)

Ins

titute

of C

om

pu

ting

Tech

no

log

y, C

hin

ese A

cad

em

y o

f Scie

nc

es

Our Method

NLSS

Reconstruct with K

nearest neighbors

CLA--modified AE with:

Sparse constraints

Compatibility constraint on overlapping

patches

79

One layer of DNC

Ins

titute

of C

om

pu

ting

Tech

no

log

y, C

hin

ese A

cad

em

y o

f Scie

nc

es

Results

Comparisons with previous methods in

terms of PSNR/SSIM

80

[15] Kim, et al. Single-image super-resolution using sparse regression and natural image prior. T PAMI 2010

[22] Lu, et al. Geometry constrained sparse coding for single image super-resolution. CVPR 2012

[33] Yang et al. Image super-resolution as sparse representation of raw image patches. CVPR 2008

Ins

titute

of C

om

pu

ting

Tech

no

log

y, C

hin

ese A

cad

em

y o

f Scie

nc

es

Example Results

Layer-by-layer

progressive

resolution

increase

81

Ins

titute

of C

om

pu

ting

Tech

no

log

y, C

hin

ese A

cad

em

y o

f Scie

nc

es

Outline


evolution









82

Ins

titute

of C

om

pu

ting

Tech

no

log

y, C

hin

ese A

cad

em

y o

f Scie

nc

es

Enhance Face Detection Via AE

Using AEs after

AdaBoosting

Classify “hard”

face/non-face

candidate windows

Handle multi-view

candidates together

83

Ins

titute

of C

om

pu

ting

Tech

no

log

y, C

hin

ese A

cad

em

y o

f Scie

nc

es

Enhance Face Detection Via AE

Comparison on FDDB

84

Ins

titute

of C

om

pu

ting

Tech

no

log

y, C

hin

ese A

cad

em

y o

f Scie

nc

es

Outline


evolution









85

Ins

titute

of C

om

pu

ting

Tech

no

log

y, C

hin

ese A

cad

em

y o

f Scie

nc

es

Problem to Solve

Unsupervised domain adaptation

Example Source domain: visible lighting face images

Target domain: near infra-red face images

Basic idea

Represent source domain samples via non-

linear combination of target domain samples

86

Ins

titute

of C

om

pu

ting

Tech

no

log

y, C

hin

ese A

cad

em

y o

f Scie

nc

es

Our Method

Bi-shifting Auto-Encoder Reduce domain discrepancy

Network structure Common encoder fc Separate decoders for either

domain gs for source domain

gt for target domain

Shifted source domain Targetized (avatar in target

domain, virtual target sample)

Labels preserved (for subsequent

supervised learning)87

Ins

titute

of C

om

pu

ting

Tech

no

log

y, C

hin

ese A

cad

em

y o

f Scie

nc

es

Our Method

Objective function

Self-reconstruction of both domains (AE)

Target samples are sparsely represented

by source samples, and vice versa.

Optimization

Alternating optimization of fc&gs&gt and

Bs&Bt

88

Ins

titute

of C

om

pu

ting

Tech

no

log

y, C

hin

ese A

cad

em

y o

f Scie

nc

es

Results

Evaluation of domain adaptation across

ethnicity

89

Ins

titute

of C

om

pu

ting

Tech

no

log

y, C

hin

ese A

cad

em

y o

f Scie

nc

es

Results

Evaluation of domain adaptation across

imaging sensor

90

Ins

titute

of C

om

pu

ting

Tech

no

log

y, C

hin

ese A

cad

em

y o

f Scie

nc

es

Outline


evolution









91

Ins

titute

of C

om

pu

ting

Tech

no

log

y, C

hin

ese A

cad

em

y o

f Scie

nc

es


DL (esp. CNN) wins with “big” data

So, collect big data…

The deeper, the better (?)

No ability to collect big data? Or, big

data is impossible?

DAE works for nonlinear transform

Past experiences help to build model

Data structure helps to design network

Priors helps to design the objective

functions

92

Ins

titute

of C

om

pu

ting

Tech

no

log

y, C

hin

ese A

cad

em

y o

f Scie

nc

es

Collaborators

93

Xilin Chen Ruiping Wang Mein Kan

Jie Zhang Mengyi Liu Zhiwu Huang Shaoxin Li

Hong Chang

Zhen Cui

deep models for face processing - shanghaitechssds2015.shanghaitech.edu.cn/slides/tutorial... ·...

Documents