create photo-realistic talking face changbo hu 2001.11.26 * this work was done during visiting...

Create Photo-Realistic Create Photo-Realistic Talking FaceTalking Face

Changbo HuChangbo Hu

2001.11.262001.11.26

**This work was done during visiting Microsoft ReThis work was done during visiting Microsoft Research China with Baining Guo and Bo Zhangsearch China with Baining Guo and Bo Zhang

OutlineOutline

Introduction of talking faceIntroduction of talking face

MotivationsMotivations

System overviewSystem overview

TechniquesTechniques

ConclusionsConclusions

IntroductionIntroduction

What is a talking faceWhat is a talking face Face (lip) animation, driven by voiceFace (lip) animation, driven by voice ApplicationsApplications

The process of talking faceThe process of talking face Face modelFace model Motion captureMotion capture Mapping betweenMapping between

audio and video audio and video Rendering, Rendering,

Photo-realistic?Photo-realistic?

LiteraturesLiteratures

Walter,93, DecFace, 2Dwire frame modelWalter,93, DecFace, 2Dwire frame model Terzopoulos,95, Skin and muscle modelTerzopoulos,95, Skin and muscle model Breglar,97, Video Rewrite, Sample image basedBreglar,97, Video Rewrite, Sample image based TS Huang,98,Mesh model from range dataTS Huang,98,Mesh model from range data Poggio,98, MikeTalk, Viseme morphingPoggio,98, MikeTalk, Viseme morphing Guenter,99, Making face, 3D from multicamera Guenter,99, Making face, 3D from multicamera Zhengyou Zhang, 00, 3D face modeling from video Zhengyou Zhang, 00, 3D face modeling from video

through epipolar constraintthrough epipolar constraint Cosatto,00, Planar quads modelCosatto,00, Planar quads model

Some Face modelsSome Face models

MotivationsMotivations

Aim: a graphics interface for conversation Aim: a graphics interface for conversation agentagent Photo-realisticPhoto-realistic Driven by ChineseDriven by Chinese Smooth connection between sentencesSmooth connection between sentences

Extended from “Video rewrite”Extended from “Video rewrite”

System overview:System overview:Pipeline of the system(1)Pipeline of the system(1)

Video with Sound

Images Sound

Pose trackingPhoneme

segmentation

AnnotationLip motion Tracking

Train database

System overview: System overview: Pipeline of the system(2)Pipeline of the system(2)

New text

Wav sound

TTS system

Triphone sequence

Segmentation

Synthesized triphone sequence

Train database

Lip motion sequence

Rewrite to faces

Background sequence

TechniquesTechniques

Analysis:Analysis: Audio processAudio process Image processImage process

SynthesisSynthesis Lip image Lip image Background imageBackground image Stitch togetherStitch together

Audio part:Audio part:Sound SegmentationSound Segmentation

Given the wav file and the scriptGiven the wav file and the script

Using HMM to train the segment systemUsing HMM to train the segment system

Segment wav file to phoneme sequenceSegment wav file to phoneme sequence

Example of the segmentation result:Example of the segmentation result:SILOPEN 0 23SILOPEN 24 42s 43 61if4 62 74j 75 80ia1 81 97sh 98 109ang1 110 121y 122 130e4 131 133y 134 145in2 146 154h 155 164ang2 165 194

Annotation with PhonemeAnnotation with Phoneme

Using phoneme to annotate video framesUsing phoneme to annotate video frames

Each phoneme in a sentence corresponds Each phoneme in a sentence corresponds to a short time of video sequenceto a short time of video sequence

Training Sentence

Audio FramesVideo Frames Phoneme Sequence

Frames for Phoneme1 Frames for Phoneme1 Phoneme1

Frames for Phoneme2 Frames for Phoneme2 Phoneme2

… … …

Phoneme Distance AnalysisPhoneme Distance Analysis

Phoneme&triphone basicsPhoneme&triphone basics

Chinese Phoneme vs. English PhonemeChinese Phoneme vs. English Phoneme

Distance Metrics definitionsDistance Metrics definitions

ResultsResults

Phoneme BasicsPhoneme Basics

Phonemes represents the basic elements Phonemes represents the basic elements in speech. All possible speech can be in speech. All possible speech can be represented by combination of phonemes.represented by combination of phonemes.

CH, JH, S, EH, EY, OY, AE, SIL…CH, JH, S, EH, EY, OY, AE, SIL…

Triphone are three consecutive Triphone are three consecutive phonemes. It not only represents phonemes. It not only represents pronounce characteristics but also pronounce characteristics but also contains context information.contains context information.

T-IY-P, IY-P-AA, P-AA-T…T-IY-P, IY-P-AA, P-AA-T…

Chinese Phoneme vs. EnglishChinese Phoneme vs. English

Chinese phoneme has two basic groups: Initials Chinese phoneme has two basic groups: Initials and Finals.and Finals.

Initials: B, P, M, F, …Initials: B, P, M, F, …Finals: a3, o1, e2, eng3, iang4, ue5, …Finals: a3, o1, e2, eng3, iang4, ue5, …

Chinese finals each has 5 tones: 1,2,3,4,5.Chinese finals each has 5 tones: 1,2,3,4,5.Different tones: a1, a2, a3, a4, a5.Different tones: a1, a2, a3, a4, a5.

Chinese finals actually is not a basic elements of Chinese finals actually is not a basic elements of speech.speech.

For example: iang1, iao1, uang1, iong1…For example: iang1, iao1, uang1, iong1…

Chinese phoneme set is much larger than Chinese phoneme set is much larger than English.English.


Define the distance between any two phonemes.Define the distance between any two phonemes.

Since we only synthesis video but not sound, so Since we only synthesis video but not sound, so tone is ignoredtone is ignored

Lip shape motion is the core element for Lip shape motion is the core element for distance metrics.distance metrics.


Video 1 Video 2 Video 4

Video 1 Video 2

Video 3

Phoneme 1:

Phoneme 2:

Time Align to an uniform length

Video 2 Video 3 Video 4

Video 2Video 1

Video 1

Average the videos to get an average video

Video Average

Video Average

By comparing the two aligned average videos, we generate the distance matrix of the whole phoneme set.

Image part: Image part: Pose TrackingPose Tracking

Assume a plane Assume a plane model for facemodel for face

Standard Standard minimization method minimization method to find transform to find transform matrix (affine matrix (affine transform)[Black,95]transform)[Black,95]

Mask is used to Mask is used to constrain interests constrain interests part of the facepart of the face

Template Picture

Mask Image

Pose trackingPose tracking

Motion prediction using parameters with Motion prediction using parameters with physical meaningphysical meaning

100

0cossin

0sincos

.

100

0

0

.

100

10

01

100543

211

syk

ksx

t

t

aaa

aaa

y

x

Pose TrackingPose Tracking

Some tracking results:Some tracking results:

Lip Motion TrackingLip Motion Tracking

Using Eigen Points (Covell, 91)Using Eigen Points (Covell, 91)

Feature Points include Jaw, lip and teethFeature Points include Jaw, lip and teeth

Training database specified manuallyTraining database specified manually

Auto tracking through all pose-tracked imaAuto tracking through all pose-tracked imagesges

Lip motion trackingLip motion tracking

Lip MotionLip Motion TrackingTrackingT

rain

D

atab

ase

(ha

nd-

labe

led)

Aut

o T

rack

ing

Res

ults

Synthesis new sentencesSynthesis new sentences

New text converted by TTS system to wavNew text converted by TTS system to wav

Wav is segmented to phoneme sequenceWav is segmented to phoneme sequence

Using DP to find an optimal video Using DP to find an optimal video sequence from the training databasesequence from the training database

Time-align triphone videos and stitch them Time-align triphone videos and stitch them together.together.

Transform the lip sequence and paste Transform the lip sequence and paste them to background faces.them to background faces.

Lip sequence synthesisLip sequence synthesis

Optimal phoneme sequences

Triphone 1

Triphone 2 Triphone 5

Triphone 3

Triphone 4

Triphone 6

Triphone 7

Triphone 8 Triphone B

Triphone 9

Triphone A

Triphone C

New phoneme sequences

New phoneme sequences

Dynamic ProgrammingDynamic Programming

Begin

Triphone1 Triphone3Triphone2 Triphone4

End

Triphone5

Edge Cost DefinitionEdge Cost Definition

Two parts: Two parts: 1.1. phoneme distance: 3 phonemes’ distances added phoneme distance: 3 phonemes’ distances added

togethertogether

2.2. Lip shape distance for the overlap portion of triphone Lip shape distance for the overlap portion of triphone videovideo

Weighted add together two partWeighted add together two part

Background video generationBackground video generation

Background is a video sequence when the Background is a video sequence when the virtual character spoke something elsevirtual character spoke something else

Similarity measurement of backgroundSimilarity measurement of background

Select “standard frame”Select “standard frame”The frame with maximal number of frames similar The frame with maximal number of frames similar to itto it

Filter out the frames with jerkinessFilter out the frames with jerkiness

yxyx swswkwwtwtwFFD ******),( 65432121

Stitch the time-aligned result to Stitch the time-aligned result to background facesbackground faces

Write back with a maskWrite back with a mask

Transform the synthesized lip to the Transform the synthesized lip to the background facebackground face

Mask image for write-back operation

Original background frame Write-back result of the same frame

create photo-realistic talking face changbo hu 2001.11.26 * this work was done during visiting...

Documents

video rewrite slide

video audio

background sequence

t slide

system1 slide

face models

planar quads model slide

phoneme sequence example