video rewrite: driving visual speech with audio
DESCRIPTION
1. Video Rewrite: Driving Visual Speech with Audio. Christoph Bregler Michele Covell Malcolm Slaney Interval Research Corporation. 2. Goal: Photo-realistic Talking Face. Video Rewrite. Handcoded 3D Model. OR. 2. Facial Animation History:. Parke (1972) - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Video Rewrite: Driving Visual Speech with Audio](https://reader030.vdocuments.site/reader030/viewer/2022032708/56812e6c550346895d940ea2/html5/thumbnails/1.jpg)
1Video Rewrite:Driving Visual Speech with Audio
Christoph Bregler
Michele Covell
Malcolm Slaney
Interval Research Corporation
![Page 2: Video Rewrite: Driving Visual Speech with Audio](https://reader030.vdocuments.site/reader030/viewer/2022032708/56812e6c550346895d940ea2/html5/thumbnails/2.jpg)
QuickTime™ and aMPEG-4 Video decompressor
are needed to see this picture.
![Page 3: Video Rewrite: Driving Visual Speech with Audio](https://reader030.vdocuments.site/reader030/viewer/2022032708/56812e6c550346895d940ea2/html5/thumbnails/3.jpg)
2Goal: Photo-realistic Talking Face
Handcoded3D Model
Video Rewrite
OR
![Page 4: Video Rewrite: Driving Visual Speech with Audio](https://reader030.vdocuments.site/reader030/viewer/2022032708/56812e6c550346895d940ea2/html5/thumbnails/4.jpg)
2
Facial Animation History:
• Parke (1972)• Cohen & Massaro, Benoit et al. (1993)• Waters & Terzopolous (1990), DEC-Face• Lewis (1991)• Litwinowicz & Williams (1994)• Chen, Graf, Petajan, et al (1995)• Scott et al (1994)• Ezzat & Poggio (1997)• Pighin et al + Gunter et al (1998)• Brand (1999)• Cosatto, Graf (2000)
![Page 5: Video Rewrite: Driving Visual Speech with Audio](https://reader030.vdocuments.site/reader030/viewer/2022032708/56812e6c550346895d940ea2/html5/thumbnails/5.jpg)
3Video Rewrite: Overview
AnalysisAnalysis
/D//D/ /IY//IY/ /P//P/ /AH//AH/
SynthesisSynthesis
![Page 6: Video Rewrite: Driving Visual Speech with Audio](https://reader030.vdocuments.site/reader030/viewer/2022032708/56812e6c550346895d940ea2/html5/thumbnails/6.jpg)
4Video Rewrite: Overview
AnalysisAnalysis
/D//D/ /IY//IY/ /P//P/ /AH//AH/
SynthesisSynthesis
![Page 7: Video Rewrite: Driving Visual Speech with Audio](https://reader030.vdocuments.site/reader030/viewer/2022032708/56812e6c550346895d940ea2/html5/thumbnails/7.jpg)
5
Annotation
• Phonetic Phonetic
• Head PoseHead Pose
• Mouth ShapeMouth Shape
/D/ /OH/ /N/ /AH/
![Page 8: Video Rewrite: Driving Visual Speech with Audio](https://reader030.vdocuments.site/reader030/viewer/2022032708/56812e6c550346895d940ea2/html5/thumbnails/8.jpg)
QuickTime™ and aMPEG-4 Video decompressor
are needed to see this picture.
![Page 9: Video Rewrite: Driving Visual Speech with Audio](https://reader030.vdocuments.site/reader030/viewer/2022032708/56812e6c550346895d940ea2/html5/thumbnails/9.jpg)
6
Phonetic Annotation
HMM Labels/D/ /IY/ /P/ /AH/
/D-IY-P/ /IY-P-AH/
![Page 10: Video Rewrite: Driving Visual Speech with Audio](https://reader030.vdocuments.site/reader030/viewer/2022032708/56812e6c550346895d940ea2/html5/thumbnails/10.jpg)
6
Phonetic Annotation
• Acoustic Front-End: RASTA-PLP (Channel Invariant)
• HMM Models / Gaussian Mixture Models (HTK)
• Phoneme Set: 56 categories (CMU)
• Triphone models trained on TIMIT
• Annotation using Forced-Viterbi
(and CMU pronunciation dictionary)
![Page 11: Video Rewrite: Driving Visual Speech with Audio](https://reader030.vdocuments.site/reader030/viewer/2022032708/56812e6c550346895d940ea2/html5/thumbnails/11.jpg)
QuickTime™ and aMPEG-4 Video decompressor
are needed to see this picture.
![Page 12: Video Rewrite: Driving Visual Speech with Audio](https://reader030.vdocuments.site/reader030/viewer/2022032708/56812e6c550346895d940ea2/html5/thumbnails/12.jpg)
5
Annotation
• Phonetic Phonetic
• Head PoseHead Pose
• Mouth ShapeMouth Shape
/D/ /OH/ /N/ /AH/
![Page 13: Video Rewrite: Driving Visual Speech with Audio](https://reader030.vdocuments.site/reader030/viewer/2022032708/56812e6c550346895d940ea2/html5/thumbnails/13.jpg)
7
Head Pose Annotation
match planartemplate
![Page 14: Video Rewrite: Driving Visual Speech with Audio](https://reader030.vdocuments.site/reader030/viewer/2022032708/56812e6c550346895d940ea2/html5/thumbnails/14.jpg)
QuickTime™ and aMPEG-4 Video decompressor
are needed to see this picture.
![Page 15: Video Rewrite: Driving Visual Speech with Audio](https://reader030.vdocuments.site/reader030/viewer/2022032708/56812e6c550346895d940ea2/html5/thumbnails/15.jpg)
5
Annotation
• Phonetic Phonetic
• Head PoseHead Pose
• Mouth ShapeMouth Shape
/D/ /OH/ /N/ /AH/
![Page 16: Video Rewrite: Driving Visual Speech with Audio](https://reader030.vdocuments.site/reader030/viewer/2022032708/56812e6c550346895d940ea2/html5/thumbnails/16.jpg)
8
Mouth / Chin Annotation
Eigenpoints
![Page 17: Video Rewrite: Driving Visual Speech with Audio](https://reader030.vdocuments.site/reader030/viewer/2022032708/56812e6c550346895d940ea2/html5/thumbnails/17.jpg)
8
Eigenpoints - Training -
Graylevel +XY Control points
![Page 18: Video Rewrite: Driving Visual Speech with Audio](https://reader030.vdocuments.site/reader030/viewer/2022032708/56812e6c550346895d940ea2/html5/thumbnails/18.jpg)
8
Eigenpoints - Mapping -
Graylevel +XY Control pointSpace
![Page 19: Video Rewrite: Driving Visual Speech with Audio](https://reader030.vdocuments.site/reader030/viewer/2022032708/56812e6c550346895d940ea2/html5/thumbnails/19.jpg)
QuickTime™ and aMPEG-4 Video decompressor
are needed to see this picture.
![Page 20: Video Rewrite: Driving Visual Speech with Audio](https://reader030.vdocuments.site/reader030/viewer/2022032708/56812e6c550346895d940ea2/html5/thumbnails/20.jpg)
QuickTime™ and aYUV420 codec decompressor
are needed to see this picture.
![Page 21: Video Rewrite: Driving Visual Speech with Audio](https://reader030.vdocuments.site/reader030/viewer/2022032708/56812e6c550346895d940ea2/html5/thumbnails/21.jpg)
9Video Rewrite: Overview
AnalysisAnalysis
/D//D/ /IY//IY/ /P//P/ /AH//AH/
SynthesisSynthesis
![Page 22: Video Rewrite: Driving Visual Speech with Audio](https://reader030.vdocuments.site/reader030/viewer/2022032708/56812e6c550346895d940ea2/html5/thumbnails/22.jpg)
10Video Rewrite: Overview
AnalysisAnalysis
/D//D/ /IY//IY/ /P//P/ /AH//AH/
SynthesisSynthesis
![Page 23: Video Rewrite: Driving Visual Speech with Audio](https://reader030.vdocuments.site/reader030/viewer/2022032708/56812e6c550346895d940ea2/html5/thumbnails/23.jpg)
11
Synthesis - Overview -
background face
![Page 24: Video Rewrite: Driving Visual Speech with Audio](https://reader030.vdocuments.site/reader030/viewer/2022032708/56812e6c550346895d940ea2/html5/thumbnails/24.jpg)
QuickTime™ and aMPEG-4 Video decompressor
are needed to see this picture.
![Page 25: Video Rewrite: Driving Visual Speech with Audio](https://reader030.vdocuments.site/reader030/viewer/2022032708/56812e6c550346895d940ea2/html5/thumbnails/25.jpg)
12
Synthesis:
• Transcribe Transcribe
• Find Lip ClipsFind Lip Clips
• Stitch TogetherStitch Together
/J/ /EH/ /L/ /IY/
![Page 26: Video Rewrite: Driving Visual Speech with Audio](https://reader030.vdocuments.site/reader030/viewer/2022032708/56812e6c550346895d940ea2/html5/thumbnails/26.jpg)
13
Matching:
/T//AA/ /AA/
![Page 27: Video Rewrite: Driving Visual Speech with Audio](https://reader030.vdocuments.site/reader030/viewer/2022032708/56812e6c550346895d940ea2/html5/thumbnails/27.jpg)
14Matching: Co-Articulation
/T//AA/ /AA/
?
/ UW - T - UW/
![Page 28: Video Rewrite: Driving Visual Speech with Audio](https://reader030.vdocuments.site/reader030/viewer/2022032708/56812e6c550346895d940ea2/html5/thumbnails/28.jpg)
15Matching: Co-Articulation
/ UW - T - UW/
/T//AA/ /AA/
match / AA - T - AA/
![Page 29: Video Rewrite: Driving Visual Speech with Audio](https://reader030.vdocuments.site/reader030/viewer/2022032708/56812e6c550346895d940ea2/html5/thumbnails/29.jpg)
16Co-Articulation: Tri-Phones
/ AA - S - AA/
/ AA - T - AA/
/ UW - T - UW/
….
More than 20,000 Tri-Phonesin English
![Page 30: Video Rewrite: Driving Visual Speech with Audio](https://reader030.vdocuments.site/reader030/viewer/2022032708/56812e6c550346895d940ea2/html5/thumbnails/30.jpg)
16Viseme based Perceptual match
P B S T K …
P
B
S
T
K
…
Owens (1985) Confusion Matrix
11 Consonant Clusters:
- CH, JH, SH, ZH - K, G, N, L - T, D, S, Z - P, B, M - F, V - TH, DH
![Page 31: Video Rewrite: Driving Visual Speech with Audio](https://reader030.vdocuments.site/reader030/viewer/2022032708/56812e6c550346895d940ea2/html5/thumbnails/31.jpg)
McGurk Effect -- Baldy by Cohen & Massaro
QuickTime™ and aCinepak decompressor
are needed to see this picture.
![Page 32: Video Rewrite: Driving Visual Speech with Audio](https://reader030.vdocuments.site/reader030/viewer/2022032708/56812e6c550346895d940ea2/html5/thumbnails/32.jpg)
17Matching: Viseme-Distance
/ UW - T - UW/
/T//AA/ /AA/
correct phonewrong context:
/ AA - S - AA/correct visemecorrect context:
![Page 33: Video Rewrite: Driving Visual Speech with Audio](https://reader030.vdocuments.site/reader030/viewer/2022032708/56812e6c550346895d940ea2/html5/thumbnails/33.jpg)
18Matching: Viseme-Distance
/ UW - T - UW/
/T//AA/ /AA/
approximatematch / AA - S - AA/
![Page 34: Video Rewrite: Driving Visual Speech with Audio](https://reader030.vdocuments.site/reader030/viewer/2022032708/56812e6c550346895d940ea2/html5/thumbnails/34.jpg)
18Matching: Overlapping Triphones
Shape Distance
![Page 35: Video Rewrite: Driving Visual Speech with Audio](https://reader030.vdocuments.site/reader030/viewer/2022032708/56812e6c550346895d940ea2/html5/thumbnails/35.jpg)
18
Matching: Trade-Offs
/T//AA/ /AA//P//IY/
Shape Distance
N-VisemeDistance
Rate of Speech Distance
![Page 36: Video Rewrite: Driving Visual Speech with Audio](https://reader030.vdocuments.site/reader030/viewer/2022032708/56812e6c550346895d940ea2/html5/thumbnails/36.jpg)
18
Matching: N-Best Dynamic Programming
Error = V(t) + R(t) + S(t-1,t)
t
N-best
![Page 37: Video Rewrite: Driving Visual Speech with Audio](https://reader030.vdocuments.site/reader030/viewer/2022032708/56812e6c550346895d940ea2/html5/thumbnails/37.jpg)
QuickTime™ and aMPEG-4 Video decompressor
are needed to see this picture.
![Page 38: Video Rewrite: Driving Visual Speech with Audio](https://reader030.vdocuments.site/reader030/viewer/2022032708/56812e6c550346895d940ea2/html5/thumbnails/38.jpg)
19
Stitching
+ +
![Page 39: Video Rewrite: Driving Visual Speech with Audio](https://reader030.vdocuments.site/reader030/viewer/2022032708/56812e6c550346895d940ea2/html5/thumbnails/39.jpg)
20
Stitching
+ +
![Page 40: Video Rewrite: Driving Visual Speech with Audio](https://reader030.vdocuments.site/reader030/viewer/2022032708/56812e6c550346895d940ea2/html5/thumbnails/40.jpg)
21
Stitching
MorphingMorphing
![Page 41: Video Rewrite: Driving Visual Speech with Audio](https://reader030.vdocuments.site/reader030/viewer/2022032708/56812e6c550346895d940ea2/html5/thumbnails/41.jpg)
21
Morphing
Affine-Warp +Beier-Neely
![Page 42: Video Rewrite: Driving Visual Speech with Audio](https://reader030.vdocuments.site/reader030/viewer/2022032708/56812e6c550346895d940ea2/html5/thumbnails/42.jpg)
21Simple Lighting Correction
Alpha Blending
X
X
Internsity
1.)
2.)
![Page 43: Video Rewrite: Driving Visual Speech with Audio](https://reader030.vdocuments.site/reader030/viewer/2022032708/56812e6c550346895d940ea2/html5/thumbnails/43.jpg)
QuickTime™ and aMPEG-4 Video decompressor
are needed to see this picture.
![Page 44: Video Rewrite: Driving Visual Speech with Audio](https://reader030.vdocuments.site/reader030/viewer/2022032708/56812e6c550346895d940ea2/html5/thumbnails/44.jpg)
22
Video Rewrite Results
JFK - Video Model
2 minutes data
Ellen - Video Model
8 minutes data
![Page 45: Video Rewrite: Driving Visual Speech with Audio](https://reader030.vdocuments.site/reader030/viewer/2022032708/56812e6c550346895d940ea2/html5/thumbnails/45.jpg)
QuickTime™ and aMPEG-4 Video decompressor
are needed to see this picture.
![Page 46: Video Rewrite: Driving Visual Speech with Audio](https://reader030.vdocuments.site/reader030/viewer/2022032708/56812e6c550346895d940ea2/html5/thumbnails/46.jpg)
23
Contributions
• Data-driven Data-driven lip animationlip animation
• Automatic Automatic using vision and speech using vision and speech
recognitionrecognition
• Photo realistic: Photo realistic:
implicitly captures specific appearance + implicitly captures specific appearance + dynamicsdynamics
![Page 47: Video Rewrite: Driving Visual Speech with Audio](https://reader030.vdocuments.site/reader030/viewer/2022032708/56812e6c550346895d940ea2/html5/thumbnails/47.jpg)
24
Video Rewrite
Thanks !
S. AhmadM. BajuraF. CrowT. DarrellM. DavisG. Gordon
John F. Kennedy
Acknowledgments:K. ForceB. FusonB. LassiterJ. LewisK. Rahardja
S. SnibbeC. SequineE. TauberB. VerplankS. WhiteJ. Woodfill
![Page 48: Video Rewrite: Driving Visual Speech with Audio](https://reader030.vdocuments.site/reader030/viewer/2022032708/56812e6c550346895d940ea2/html5/thumbnails/48.jpg)
1994: Scott et al (JPL + Graphco Technologies)
/o/
/n/
/e/
![Page 49: Video Rewrite: Driving Visual Speech with Audio](https://reader030.vdocuments.site/reader030/viewer/2022032708/56812e6c550346895d940ea2/html5/thumbnails/49.jpg)
1994: Scott et al (JPL + Graphco Technologies)
QuickTime™ and aYUV420 codec decompressor
are needed to see this picture.
![Page 50: Video Rewrite: Driving Visual Speech with Audio](https://reader030.vdocuments.site/reader030/viewer/2022032708/56812e6c550346895d940ea2/html5/thumbnails/50.jpg)
1994: Scott et al (JPL + Graphco Technologies)
QuickTime™ and aYUV420 codec decompressor
are needed to see this picture.
![Page 51: Video Rewrite: Driving Visual Speech with Audio](https://reader030.vdocuments.site/reader030/viewer/2022032708/56812e6c550346895d940ea2/html5/thumbnails/51.jpg)
Matching Video-Snippets with Context
/ AA - S - AA/
/ AA - T - AA/
/ UW - T - UW/
….
“Video Model”
N-phone context
/T/ /AA/ /UW/ /S/
![Page 52: Video Rewrite: Driving Visual Speech with Audio](https://reader030.vdocuments.site/reader030/viewer/2022032708/56812e6c550346895d940ea2/html5/thumbnails/52.jpg)
2000: Cosatto, Graf, AT&T Research
QuickTime™ and aYUV420 codec decompressor
are needed to see this picture.
![Page 53: Video Rewrite: Driving Visual Speech with Audio](https://reader030.vdocuments.site/reader030/viewer/2022032708/56812e6c550346895d940ea2/html5/thumbnails/53.jpg)
2000: Cosatto, Graf, AT&T Research
QuickTime™ and a decompressor
are needed to see this picture.
![Page 54: Video Rewrite: Driving Visual Speech with Audio](https://reader030.vdocuments.site/reader030/viewer/2022032708/56812e6c550346895d940ea2/html5/thumbnails/54.jpg)
24Rewrite Techniques -- Future --
Model Data
Video Rewrite