sphear workshop 2000 labeling an audio-visual database and training an ann/hmm audio-visual speech...
TRANSCRIPT
SPHEAR Workshop 2000
Labeling an Audio-Visual Database and Training an ANN/HMM Audio-Visual Speech Recognition system
Universität KarlsruheMartin HeckmannKristian Kroschel
Institut de la Communication parléeFrédéric BerthommierChristophe Savariaux
Page 2
Universität Karlsruhe
M. HeckmannK. Kroschel
Institut de la Communication Parlée (ICP), Grenoble
F. BerthommierC. Savariaux
Overview
The Database The System Multi-Stage Labeling Results Outlook
Page 3
Universität Karlsruhe
M. HeckmannK. Kroschel
Institut de la Communication Parlée (ICP), Grenoble
F. BerthommierC. Savariaux
Database Acquisition
Transposition of NUMBERS95 Audio-Visual repetition of a subset
of NB95 (1700 sentences)
Audio and video recordings Chroma key process to extract lips
parameters
Page 4
Universität Karlsruhe
M. HeckmannK. Kroschel
Institut de la Communication Parlée (ICP), Grenoble
F. BerthommierC. Savariaux
Lip parameters
outer width
lip surface
inner mouth surface
outer heightinner width
inner height
Page 5
Universität Karlsruhe
M. HeckmannK. Kroschel
Institut de la Communication Parlée (ICP), Grenoble
F. BerthommierC. Savariaux
Recognition System
Hybrid ANN/HMM audio-visual system (STRUT)
Separate Identification (SI) structureSNR given as contextual information
RASTA-PLP
Chrom a-Key
AudioANN
VideoANN
AVFusion
HM M
SNR
Page 6
Universität Karlsruhe
M. HeckmannK. Kroschel
Institut de la Communication Parlée (ICP), Grenoble
F. BerthommierC. Savariaux
Recognition System
Independent weighting of each posterior value for each frame is possible with STRUT
13x3x9
6x3x9
. . .
. . .
. .
..
. .
. . .
. .
.
. . .
. .
..
. .
AudioParam eters
VideoParam eters
. .
. .
. .
. .
. .
. .
1-
1-
27
27
27
16
203
51
0
Page 7
Universität Karlsruhe
M. HeckmannK. Kroschel
Institut de la Communication Parlée (ICP), Grenoble
F. BerthommierC. Savariaux
Recognition System
Fusion of audio and video via:
Phoneme duration modeled via concatenation of states in the HMM
Dictionary containing 30 words is usedGrammar free continuous numbers
recognitionNo distinction between phonemes and
visemes is made
)|()|(),|( 1ViVAiAVAi HPHPHP xxxx
Page 8
Universität Karlsruhe
M. HeckmannK. Kroschel
Institut de la Communication Parlée (ICP), Grenoble
F. BerthommierC. Savariaux
Multi-Stage Labeling
Training on large multi speaker database NUMBERS95
WER on NB95: 11.6%
Numbers95 ANN Training
Audio Pretraining on NUM BERS95
AV DatabaseAudio
Alignment
Audio Labeling
ANN AudioPretra ined
ANN AudioPretra ined
Audio Training
ANN Training ANN AudioTrained
Video Labeling
AV DatabaseAudio R elabeled
AV DatabaseVideo
AV DatabaseVideo Labeled
AV DatabaseAudio Labeled
AV DatabaseAudio Labeled
AV DatabaseAudio
Alignment
Audio Relabeling
ANN AudioRetra ined
AV DatabaseAudio R elabeled
Page 9
Universität Karlsruhe
M. HeckmannK. Kroschel
Institut de la Communication Parlée (ICP), Grenoble
F. BerthommierC. Savariaux
Multi-Stage Labeling
Multi speaker NUMBERS95
WER on NB95: 11.6%
Single speaker Audio-visual NB95
WER on AVNB95: 28.5%
Forced Viterbi alignment
Numbers95 ANN Training
Audio Pretraining on NUM BERS95
AV DatabaseAudio
Alignment
Audio Labeling
ANN AudioPretra ined
ANN AudioPretra ined
Audio Training
ANN Training ANN AudioTrained
Video Labeling
AV DatabaseAudio R elabeled
AV DatabaseVideo
AV DatabaseVideo Labeled
AV DatabaseAudio Labeled
AV DatabaseAudio Labeled
AV DatabaseAudio
Alignment
Audio Relabeling
ANN AudioRetra ined
AV DatabaseAudio R elabeled
Page 10
Universität Karlsruhe
M. HeckmannK. Kroschel
Institut de la Communication Parlée (ICP), Grenoble
F. BerthommierC. Savariaux
Multi-Stage Labeling
WER AVNB95 first labeling: 7.1%
Numbers95 ANN Training
Audio Pretraining on NUM BERS95
AV DatabaseAudio
Alignment
Audio Labeling
ANN AudioPretra ined
ANN AudioPretra ined
Audio Training
ANN Training ANN AudioTrained
Video Labeling
AV DatabaseAudio R elabeled
AV DatabaseVideo
AV DatabaseVideo Labeled
AV DatabaseAudio Labeled
AV DatabaseAudio Labeled
AV DatabaseAudio
Alignment
Audio Relabeling
ANN AudioRetra ined
AV DatabaseAudio R elabeled
Page 11
Universität Karlsruhe
M. HeckmannK. Kroschel
Institut de la Communication Parlée (ICP), Grenoble
F. BerthommierC. Savariaux
Multi-Stage Labeling
WER AVNB95 first labeling: 7.1%
WER AVNB95 second labeling: 4%
Numbers95 ANN Training
Audio Pretraining on NUM BERS95
AV DatabaseAudio
Alignment
Audio Labeling
ANN AudioPretra ined
ANN AudioPretra ined
Audio Training
ANN Training ANN AudioTrained
Video Labeling
AV DatabaseAudio R elabeled
AV DatabaseVideo
AV DatabaseVideo Labeled
AV DatabaseAudio Labeled
AV DatabaseAudio Labeled
AV DatabaseAudio
Alignment
Audio Relabeling
ANN AudioRetra ined
AV DatabaseAudio R elabeled
Page 12
Universität Karlsruhe
M. HeckmannK. Kroschel
Institut de la Communication Parlée (ICP), Grenoble
F. BerthommierC. Savariaux
Multi-Stage Labeling
Direct application of the audio labeling to the video path
WER using only video: 35.1%
Numbers95 ANN Training
Audio Pretraining on NUM BERS95
AV DatabaseAudio
Alignment
Audio Labeling
ANN AudioPretra ined
ANN AudioPretra ined
Audio Training
ANN Training ANN AudioTrained
Video Labeling
AV DatabaseAudio R elabeled
AV DatabaseVideo
AV DatabaseVideo Labeled
AV DatabaseAudio Labeled
AV DatabaseAudio Labeled
AV DatabaseAudio
Alignment
Audio Relabeling
ANN AudioRetra ined
AV DatabaseAudio R elabeled
Page 13
Universität Karlsruhe
M. HeckmannK. Kroschel
Institut de la Communication Parlée (ICP), Grenoble
F. BerthommierC. Savariaux
Result of the labeling
Result of audio and video labeling for „three oh two“
0 0.5 1 1.5
0 0.5 1 1.5
#p th r iy #p ow t uw #p
0 0.5 1 1.5
Inner lip height
#p
Page 14
Universität Karlsruhe
M. HeckmannK. Kroschel
Institut de la Communication Parlée (ICP), Grenoble
F. BerthommierC. Savariaux
Effects of Audio Pretraining
Comparison between training from scratch on AVNB95 and pretraining on NB95 followed by training continuation on AVNB95
-12dB -6dB 0dB 6dB 12dB clean0
10
20
30
40
50
60
70
80
90
100
Wo
rd E
rro
r R
ate
in %
Audio pretrained Audio from scratch
Page 15
Universität Karlsruhe
M. HeckmannK. Kroschel
Institut de la Communication Parlée (ICP), Grenoble
F. BerthommierC. Savariaux
Audio-Visual Recognition
Comparison of audio, video and audio-visual recognition
-12dB -6dB 0dB 6dB 12dB clean0
10
20
30
40
50
60
70
80
90
100
Wo
rd E
rro
r R
ate
in %
Video Audio Audio-Visual
Page 16
Universität Karlsruhe
M. HeckmannK. Kroschel
Institut de la Communication Parlée (ICP), Grenoble
F. BerthommierC. Savariaux
Conclusion
Automatic audio-visual labeling process presented
Pretraining on a large audio database only advantageous for labeling
Good audio-visual recognition scores with hybrid ANN/HMM system
Page 17
Universität Karlsruhe
M. HeckmannK. Kroschel
Institut de la Communication Parlée (ICP), Grenoble
F. BerthommierC. Savariaux
Outlook
Improvements of the recognition scores via Grouping of visually identical phonemes
to visemes Introduction of a more elaborated fusion
modality
Implementation of a system to determine the weighting of audio and video from the data (e.g. entropy of posteriors) instead of using SNR given as context information