create photo-realistic talking face changbo hu 2001.11.26 * this work was done during visiting...
Post on 20-Dec-2015
215 views
TRANSCRIPT
Create Photo-Realistic Create Photo-Realistic Talking FaceTalking Face
Changbo HuChangbo Hu
2001.11.262001.11.26
**This work was done during visiting Microsoft ReThis work was done during visiting Microsoft Research China with Baining Guo and Bo Zhangsearch China with Baining Guo and Bo Zhang
OutlineOutline
Introduction of talking faceIntroduction of talking face
MotivationsMotivations
System overviewSystem overview
TechniquesTechniques
ConclusionsConclusions
IntroductionIntroduction
What is a talking faceWhat is a talking face Face (lip) animation, driven by voiceFace (lip) animation, driven by voice ApplicationsApplications
The process of talking faceThe process of talking face Face modelFace model Motion captureMotion capture Mapping betweenMapping between
audio and video audio and video Rendering, Rendering,
Photo-realistic?Photo-realistic?
LiteraturesLiteratures
Walter,93, DecFace, 2Dwire frame modelWalter,93, DecFace, 2Dwire frame model Terzopoulos,95, Skin and muscle modelTerzopoulos,95, Skin and muscle model Breglar,97, Video Rewrite, Sample image basedBreglar,97, Video Rewrite, Sample image based TS Huang,98,Mesh model from range dataTS Huang,98,Mesh model from range data Poggio,98, MikeTalk, Viseme morphingPoggio,98, MikeTalk, Viseme morphing Guenter,99, Making face, 3D from multicamera Guenter,99, Making face, 3D from multicamera Zhengyou Zhang, 00, 3D face modeling from video Zhengyou Zhang, 00, 3D face modeling from video
through epipolar constraintthrough epipolar constraint Cosatto,00, Planar quads modelCosatto,00, Planar quads model
Some Face modelsSome Face models
MotivationsMotivations
Aim: a graphics interface for conversation Aim: a graphics interface for conversation agentagent Photo-realisticPhoto-realistic Driven by ChineseDriven by Chinese Smooth connection between sentencesSmooth connection between sentences
Extended from “Video rewrite”Extended from “Video rewrite”
System overview:System overview:Pipeline of the system(1)Pipeline of the system(1)
Video with Sound
Images Sound
Pose trackingPhoneme
segmentation
AnnotationLip motion Tracking
Train database
System overview: System overview: Pipeline of the system(2)Pipeline of the system(2)
New text
Wav sound
TTS system
Triphone sequence
Segmentation
Synthesized triphone sequence
Train database
Lip motion sequence
Rewrite to faces
Background sequence
TechniquesTechniques
Analysis:Analysis: Audio processAudio process Image processImage process
SynthesisSynthesis Lip image Lip image Background imageBackground image Stitch togetherStitch together
Audio part:Audio part:Sound SegmentationSound Segmentation
Given the wav file and the scriptGiven the wav file and the script
Using HMM to train the segment systemUsing HMM to train the segment system
Segment wav file to phoneme sequenceSegment wav file to phoneme sequence
Example of the segmentation result:Example of the segmentation result:SILOPEN 0 23SILOPEN 24 42s 43 61if4 62 74j 75 80ia1 81 97sh 98 109ang1 110 121y 122 130e4 131 133y 134 145in2 146 154h 155 164ang2 165 194
Annotation with PhonemeAnnotation with Phoneme
Using phoneme to annotate video framesUsing phoneme to annotate video frames
Each phoneme in a sentence corresponds Each phoneme in a sentence corresponds to a short time of video sequenceto a short time of video sequence
Training Sentence
Audio FramesVideo Frames Phoneme Sequence
Frames for Phoneme1 Frames for Phoneme1 Phoneme1
Frames for Phoneme2 Frames for Phoneme2 Phoneme2
… … …
Phoneme Distance AnalysisPhoneme Distance Analysis
Phoneme&triphone basicsPhoneme&triphone basics
Chinese Phoneme vs. English PhonemeChinese Phoneme vs. English Phoneme
Distance Metrics definitionsDistance Metrics definitions
ResultsResults
Phoneme BasicsPhoneme Basics
Phonemes represents the basic elements Phonemes represents the basic elements in speech. All possible speech can be in speech. All possible speech can be represented by combination of phonemes.represented by combination of phonemes.
CH, JH, S, EH, EY, OY, AE, SIL…CH, JH, S, EH, EY, OY, AE, SIL…
Triphone are three consecutive Triphone are three consecutive phonemes. It not only represents phonemes. It not only represents pronounce characteristics but also pronounce characteristics but also contains context information.contains context information.
T-IY-P, IY-P-AA, P-AA-T…T-IY-P, IY-P-AA, P-AA-T…
Chinese Phoneme vs. EnglishChinese Phoneme vs. English
Chinese phoneme has two basic groups: Initials Chinese phoneme has two basic groups: Initials and Finals.and Finals.
Initials: B, P, M, F, …Initials: B, P, M, F, …Finals: a3, o1, e2, eng3, iang4, ue5, …Finals: a3, o1, e2, eng3, iang4, ue5, …
Chinese finals each has 5 tones: 1,2,3,4,5.Chinese finals each has 5 tones: 1,2,3,4,5.Different tones: a1, a2, a3, a4, a5.Different tones: a1, a2, a3, a4, a5.
Chinese finals actually is not a basic elements of Chinese finals actually is not a basic elements of speech.speech.
For example: iang1, iao1, uang1, iong1…For example: iang1, iao1, uang1, iong1…
Chinese phoneme set is much larger than Chinese phoneme set is much larger than English.English.
Phoneme Distance AnalysisPhoneme Distance Analysis
Define the distance between any two phonemes.Define the distance between any two phonemes.
Since we only synthesis video but not sound, so Since we only synthesis video but not sound, so tone is ignoredtone is ignored
Lip shape motion is the core element for Lip shape motion is the core element for distance metrics.distance metrics.
Phoneme Distance AnalysisPhoneme Distance Analysis
Video 1 Video 2 Video 4
Video 1 Video 2
Video 3
Phoneme 1:
Phoneme 2:
Time Align to an uniform length
Video 2 Video 3 Video 4
Video 2Video 1
Video 1
Average the videos to get an average video
Video Average
Video Average
By comparing the two aligned average videos, we generate the distance matrix of the whole phoneme set.
Image part: Image part: Pose TrackingPose Tracking
Assume a plane Assume a plane model for facemodel for face
Standard Standard minimization method minimization method to find transform to find transform matrix (affine matrix (affine transform)[Black,95]transform)[Black,95]
Mask is used to Mask is used to constrain interests constrain interests part of the facepart of the face
Template Picture
Mask Image
Pose trackingPose tracking
Motion prediction using parameters with Motion prediction using parameters with physical meaningphysical meaning
100
0cossin
0sincos
.
100
0
0
.
100
10
01
100543
211
syk
ksx
t
t
aaa
aaa
y
x
Pose TrackingPose Tracking
Some tracking results:Some tracking results:
Lip Motion TrackingLip Motion Tracking
Using Eigen Points (Covell, 91)Using Eigen Points (Covell, 91)
Feature Points include Jaw, lip and teethFeature Points include Jaw, lip and teeth
Training database specified manuallyTraining database specified manually
Auto tracking through all pose-tracked imaAuto tracking through all pose-tracked imagesges
Lip motion trackingLip motion tracking
Lip MotionLip Motion TrackingTrackingT
rain
D
atab
ase
(ha
nd-
labe
led)
Aut
o T
rack
ing
Res
ults
Synthesis new sentencesSynthesis new sentences
New text converted by TTS system to wavNew text converted by TTS system to wav
Wav is segmented to phoneme sequenceWav is segmented to phoneme sequence
Using DP to find an optimal video Using DP to find an optimal video sequence from the training databasesequence from the training database
Time-align triphone videos and stitch them Time-align triphone videos and stitch them together.together.
Transform the lip sequence and paste Transform the lip sequence and paste them to background faces.them to background faces.
Lip sequence synthesisLip sequence synthesis
Optimal phoneme sequences
Triphone 1
Triphone 2 Triphone 5
Triphone 3
Triphone 4
Triphone 6
Triphone 7
Triphone 8 Triphone B
Triphone 9
Triphone A
Triphone C
New phoneme sequences
New phoneme sequences
Dynamic ProgrammingDynamic Programming
Begin
Triphone1 Triphone3Triphone2 Triphone4
End
Triphone5
Edge Cost DefinitionEdge Cost Definition
Two parts: Two parts: 1.1. phoneme distance: 3 phonemes’ distances added phoneme distance: 3 phonemes’ distances added
togethertogether
2.2. Lip shape distance for the overlap portion of triphone Lip shape distance for the overlap portion of triphone videovideo
Weighted add together two partWeighted add together two part
Background video generationBackground video generation
Background is a video sequence when the Background is a video sequence when the virtual character spoke something elsevirtual character spoke something else
Similarity measurement of backgroundSimilarity measurement of background
Select “standard frame”Select “standard frame”The frame with maximal number of frames similar The frame with maximal number of frames similar to itto it
Filter out the frames with jerkinessFilter out the frames with jerkiness
yxyx swswkwwtwtwFFD ******),( 65432121
Stitch the time-aligned result to Stitch the time-aligned result to background facesbackground faces
Write back with a maskWrite back with a mask
Transform the synthesized lip to the Transform the synthesized lip to the background facebackground face
Mask image for write-back operation
Original background frame Write-back result of the same frame
More video resultsMore video results
More video resultsMore video results
Conclusion and Future WorkConclusion and Future Work
Pose tracking and lip motion trackingPose tracking and lip motion tracking
Size of the train databaseSize of the train database
Talking face with expressionTalking face with expression
Real-time generation?Real-time generation?
Fast modeling for different personFast modeling for different person
Animation Animation
Thank you