frederico rodrigues and isabel trancoso
DESCRIPTION
Robust Recognition of Digits and Natural Numbers. Frederico Rodrigues and Isabel Trancoso. INESC/IST, 2000. Summary. Problem overview Baseline system Extensions to the baseline system Conclusions and future work. Microphone. Microphone. Position. Channel. Distortion. Distortion. - PowerPoint PPT PresentationTRANSCRIPT
Frederico Rodrigues and Isabel Trancoso
INESC/IST, 2000
Robust Recognition of Digits and Natural Numbers
2
SummarySummary
Problem overview
Baseline system
Extensions to the baseline system
Conclusions and future work
3
The ProblemThe Problem
Speaker
Gender
Age
Vocal tract characteristics
Pronunciation
Rate of Speech
Stress
Lombard Reflex
Microphone
Microphone
Position
Distortion Channel
Distortion
Noise
Environment
Background noises
Intermitent noises
Coktail party noises
Reverberation
4
CorpusCorpus Description Description
Multilingual telephone speech corpus
SPEECHDAT(M) 1000 speakers
SPEECHDAT(II) 4000 speakers
Orthographically transcribed including
noise events
5
Noise eventsNoise events
[spk] : Speaker related noises
[sta] : Stationary noises
[int] : Intermittent noises
SPEECHDAT(II) SPEECHDAT(M)
[spk] Blow, loud breath, other speaker noises
[sta] channel noise, background noise
[int] cross talk, radio, telephone, other
Pedir-lhe-emos agora que leia a coluna da direita da seguinte lista:
1. Leia o número algarismo a algarismo 3 6 4 8 22. Leia a frase A derrota veio num golo que teve um remate muito
bonito.3. Leia o nome da cidade ou vila Edimburgo4. Soletre a palavra (letra a letra) E, D, I, M, B, U, R, G, O5. Leia a frase Pincele tudo com uma gema de ovo misturada com
uma colher de sopa de água.6. Leia as horas onze horas e cinco minutos7. Leia a palavra operador8. Leia a quantia em dinheiro 18.362$0027. Leia o número zero28. Leia a palavra conferência29. Leia a frase O estado apostou sem risco e embolsou mais de dez
milhões de contos30. Leia o código pessoal 1 4 1 4 2 031. Leia a palavra sopro32. Leia o número algarismo a algarismo 9 0 5 2 7 3 1 8 4 6
10. Leia o número de cartão de crédito 3483 1331 0764 708211. Leia a frase Eu queria telefonar12. Leia o número por extenso 19.39513. Leia a frase O deputado participou, em sessenta e um, na
Operação Dulcinea, conduzida por Henrique Galvão.14. Leia a data Domingo, 21 de Maio de 2000
7
Train and Test Set DefinitionTrain and Test Set Definition
Selection procedure– Age, gender and region distribution are
approximately equal in both train and test sets;
SPEECHDAT II– Fixed 500 speakers evaluation set– Additional 300 speakers development set
SPEECHDAT(M)– 200 speakers evaluation set
Overall ratio of 80% Train/20% Test
8
I1 B1 N*Train Set 2954 2905 5059
Evaluation Set 768 491 260Development Set - 277 467
Total 3722 3673 5786
Sub-corpus UsedSub-corpus Used
I1 - Isolated digit stringsB1 - Sequences of 10 digitsN* - Natural numbers
9
Feature ExtractionFeature Extraction
MFCC (Mel Frequency Cepstral Coefficients)
– 14 Cepstra + 14 Cepstra + Energy + Energy
– Speech signal band-limited between 200 and 3800 Hz
– Hamming Window: 25 ms each 10 ms
Cepstral Mean Substraction– Simple but effective technique for channel and
speaker normalization
10
Acoustic ModelingAcoustic Modeling
Left-right continuous density HMM’s– Word models for each digit. No skips.
– Silence and filler models with forward and backward skips
Gender dependent models
HMM: Hidden Markov Model
11
Model TopologyModel Topology
Fillers and silence models topology
Nº of States Models3 “um”, fillers, silence
6 “zero”, “três”, “quatro”,“cinco”, “oito”, “nove”
7 “sete”8 “dois”, “seis”
12
Baseline System - Isolated Baseline System - Isolated DigitsDigits
Choose isolated digits with no noise marks– HMM parameters initialized with the global mean and
variance of the training data
Embedded Baum-Welch ReestimationEvaluate performance withViterbi decoding
– Grammar allowing one digit and initial and final silence– Grammar allowing one digit and any number of fillers or
silence
13
"Zero"
"Um"
"Oito"
"Nove"
Filler models
Silence
Filler models
Silence
"Zero"
"Um"
"Oito"
"Nove"
Silence Silence
Baseline System - Isolated Baseline System - Isolated DigitsDigits
14
Baseline System - Isolated Baseline System - Isolated DigitsDigitsIncrement Gaussian mixtures per state
up to 3 for the digit modelsIntroduce files with noise marksRepeat re-estimation/evaluation
processIncrement Gaussian mixtures per state
up to 3 for the filler and digit models
15
Connected vs Isolated DigitsConnected vs Isolated Digits
Example:
Number 3 1 2 6 said as:
Isolated Digits: t r e S u~ d o j S s 6 j S
Connected Digits: t r e z u~ d o j S _ 6 j S
16
Baseline System - Connected Baseline System - Connected DigitsDigits
Use best isolated digit models as bootstrap models
Repeat re-estimation/evaluation process
Increment gradually Gaussian mixtures per state up to 5 for the digit models
17
Baseline System - ResultsBaseline System - Results
% Correctness Accuracy
Isolated Digits 99.0 99.0
Connected DigitsKnown-length grammar
97.5 97.3
Connected DigitsUnknown-length grammar
96.2 96.1
18
Extension to the Baseline Extension to the Baseline SystemSystem
New way of modelling the filler models
Same training/evaluation process
Train the 9 filler and silence models with no skips
Build a unique filler model concatenating all filler and silence models
19
New Filler Model New Filler Model ArquitectureArquitecture
Silence
Filler 1
Filler 8
Filler 9
20
% Correctness % Accuracy
Isolated Digits 99,5 99,5
Connected DigitsKnown-length grammar
98,5 98,2
Connected DigitsUnknown-length grammar
97,8 96,4
% Correcção % Precisão
BS EXT BS EXT
Isolated Digits 99,0 99,4 99,0 99,4
Connected DigitsKnown-length grammar
97,8 98,1 97,5 98,0
Connected DigitsUnknown-length grammar
97,2 97,9 95,1 96,1
Results With New Filler Results With New Filler ModelModel
21
Natural NumbersNatural NumbersPhone models with 3 states and no skips
• Larger vocabulary size• May be adapted to other tasks
Phones initialized from models already trained for a directory assistance task
Digits are still modeled by word models
Grammar for natural numbers ranging from zero to hundreds of millions
22
Natural Numbers ExampleNatural Numbers Example
Number 25:
Hypothesis 1: vinte e cinco (Twenty and five)
Hypotesis 2: vinte cinco (Twenty five)
But “vinte cinco” could also be the sequence of natural numbers: 20 5
23
Natural Numbers - ResultsNatural Numbers - Results
# Mixtures % Correctness Accuracy
1 90,9 90,2
2 95,5 95,0
5 98,5 98,4
24
Sample ApplicationSample Application
State Control
Speech Recording
User
Server
Feature Extraction
Speech Recognition
Speech SynthesisDIXI - SVIT
Client
Speech
Prompts
Speech / Commands
Synthesised answer/ Commands
Answer
25
Conclusions and Future WorkConclusions and Future Work
Explicitly modeling fillers is a difficult task– Improved filler model decreases error rate up to 50
%
Develop context dependent models– Solve vowel reduction and co-articulation problems
Results may be improved through the use of discriminative training techniques